CI: Select jobs by touched code #9637

ldoktor · 2024-05-15T11:56:35Z

This PR adds 2 workflows:

gatekeeper-skipper
gatekeeper

The skipper can be used in workflows to detect changes and output set of tests to be executed (or skipped), the gatekeeper watches all the running jobs and checks the status of expected jobs to pass (based on changed files), which should replace the "required" feature of GH (once we make this workflow required).

Note this solution should be quite scalable so things like advanced heuristic or GH comments/labels overrides should be possible.

Example PRs (with slightly outdated gatekeeper but mainly similar):

Note you can get gatekeeper output of any existing PR (yours or kata-containers) by running something like GITHUB_TOKEN="" REQUIRED_JOBS="foo;bar" REQUIRED_REGEXPS=".*" COMMIT_HASH=b8382cea886ad9a8f77d237bcfc0eba0c98775dd GITHUB_REPOSITORY=kata-containers/kata-containers python3 tools/testing/gatekeeper/jobs.py

to allow selective testing as well as selective list of required tests let's add a mapping of required jobs/tests in "skips.py" and a "gatekeaper" workflow that will ensure the expected required jobs were successful. Then we can only mark the "gatekeaper" as the required job and modify the logic to suit our needs. Fixes: kata-containers#9237 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

github action has a limit of 20 referenced workflows. Let's combine all the amd64 related run-k8s-tests into a single job file to workaround this limitation. They all share the same prerequisits so the result should be the same. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>

wainersm · 2024-05-27T14:52:17Z

tools/testing/gatekeeper/skips.py

+                if regexp.search(line):
+                    for feature in features:
+                        enabled_features.add(feature)
+                    break


Hi @ldoktor !

This break is very important, as it applies the sort by order of importance rule. For example, if the first match suggest an empty list of features then, in practice, it is a skip; if there are other matches to this same line that suggest a list of features then they won't apply. So shall we get this line document to avoid future bugs by accident removal of this break or even a change on the algorithm?

The same here, I'll add a comment in v2 (my code/comment ratio is way below my average as this is a product of many failed attempts)

wainersm · 2024-05-27T15:23:47Z

tools/testing/gatekeeper/skips.py

+        enabled_features = self.get_features(target_branch)
+        if not tests:
+            for feature in self.all_set_of_tests:
+                print(f"skip_{feature}=" +


The output of this script is redirected (appended) to $GITHUB_OUTPUT which end up as outputs of the skipper job. It's is expected the format key=value as in https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs . Maybe worth a inline documentation to, again, avoid changing this by accident.

Sure, let me add it to v2

wainersm · 2024-05-27T18:39:53Z

tools/testing/gatekeeper/jobs.py

+            headers=_GH_HEADERS,
+            timeout=60
+        )
+        response.raise_for_status()


In two places it can crash this script ( response.raise_for_status() throwing exception due API unavailable), failing the gatekeeper job and consequently blocking a PR. Can we resume the checking by re-running the gatekeeper job from the Github web UI?

Regardless, I think we should implement a retry mechanism to call the Github API. The interwebs says requests has a retry implementation with backoff!

At this point I'd like to keep it this way to experiment with the limits. If it becomes an issue, we should get the exception with details. I haven't found the limits for jobs on GH runners but somewhere I read there should not be any limit.

Anyway it's just another workflow, one can re-try it from the UI.

wainersm · 2024-05-27T18:44:13Z

.github/workflows/gatekeeper.yaml

+          fetch-depth: 0
+      - id: gatekeeper
+        env:
+          PR_NUMBER: ${{ github.event.pull_request.number }}


PR_NUMBER seems unused.

True, let me remove it in v2

wainersm · 2024-05-27T21:20:45Z

tools/testing/gatekeeper/jobs.py

+                                    for job, status in self.results
+                                    if status == RUNNING])
+                print(f"{running_jobs} jobs are still running...")
+                time.sleep(60)


It would be good to parameterize this value, 60s seems too small (build and test jobs take several minutes to finish). I.e. we might need to tune that value to ensure we don't hit the API rate limits:

https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#primary-rate-limit-for-github_token-in-github-actions

https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#about-secondary-rate-limits

Wow, you found the limits, good. We can change it as well as we can make it dynamic (let's say we expect the failures first after 15 or 30m, then we can start checking per 1m and after 2h increase it to hour (the assumption would be we know tests don't finish earlier than in 30m, then we want to be checking frequently to fail fast but when things take too long, we don't need that frequency and can check once an hour). Or anything else we can come-up with.

wainersm · 2024-05-27T22:01:50Z

tools/testing/gatekeeper/jobs.py

+                print(f"{running_jobs} jobs are still running...")
+                time.sleep(60)
+                continue
+            sys.exit(ret)


One improvement that I'm still unsure if it should implemented now or later, is about proper reporting the required jobs that failed. With this new approach we won't have the '(Required)' marker in front of the jobs and I'm afraid users (mainly newcomers) won't immediately open the gatekeeper's logs to check the jobs that failed.

And we might be on the edge for creating a Github App :)

They would get a bunch of failed non-required tests but one (or multiple) failed required tests, one of them being the gatekeeper. And last note from gatekeeper should explain the situation and eventually navigate them to the right place.

pmores · 2024-05-28T07:23:18Z

tools/testing/gatekeeper/jobs.py

+* COMMIT_HASH: Full commit hash we want to be watching
+* GITHUB_REPOSITORY: Github repository (user/repo)
+Sample execution (GH token can be excluded):
+GITHUB_TOKEN="..." REQUIRED_JOBS="skipper / skipper"


Apologies in advance if I'm just being boneheaded but line 10 above says that REQUIRED_JOBS should be a comma-separated list which "skipper / skipper" doesn't seem to be.

It is, the full name of skipper job is skipper / skipper as it includes the workflow_name.

zvonkok · 2024-05-28T09:59:45Z

Would this also help in rerunning only failed jobs rather than rerunning a complete workflow?

ldoktor added 2 commits May 15, 2024 13:48

katacontainersbot added the size/huge Largest and most complex task (probably needs breaking into small pieces) label May 15, 2024

ldoktor mentioned this pull request May 15, 2024

CI: Select jobs by touched code #9447

Closed

wainersm reviewed May 27, 2024

View reviewed changes

pmores reviewed May 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: Select jobs by touched code #9637

CI: Select jobs by touched code #9637

ldoktor commented May 15, 2024 •

edited

wainersm May 27, 2024

ldoktor May 28, 2024

wainersm May 27, 2024

ldoktor May 28, 2024

wainersm May 27, 2024

ldoktor May 28, 2024

wainersm May 27, 2024

ldoktor May 28, 2024

wainersm May 27, 2024

ldoktor May 28, 2024

wainersm May 27, 2024

ldoktor May 28, 2024

pmores May 28, 2024

ldoktor May 28, 2024

zvonkok commented May 28, 2024

CI: Select jobs by touched code #9637

Are you sure you want to change the base?

CI: Select jobs by touched code #9637

Conversation

ldoktor commented May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zvonkok commented May 28, 2024

ldoktor commented May 15, 2024 •

edited