-
Notifications
You must be signed in to change notification settings - Fork 996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI: Select jobs by touched code #9637
base: main
Are you sure you want to change the base?
Conversation
to allow selective testing as well as selective list of required tests let's add a mapping of required jobs/tests in "skips.py" and a "gatekeaper" workflow that will ensure the expected required jobs were successful. Then we can only mark the "gatekeaper" as the required job and modify the logic to suit our needs. Fixes: kata-containers#9237 Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
github action has a limit of 20 referenced workflows. Let's combine all the amd64 related run-k8s-tests into a single job file to workaround this limitation. They all share the same prerequisits so the result should be the same. Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
if regexp.search(line): | ||
for feature in features: | ||
enabled_features.add(feature) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ldoktor !
This break
is very important, as it applies the sort by order of importance
rule. For example, if the first match suggest an empty list of features then, in practice, it is a skip; if there are other matches to this same line that suggest a list of features then they won't apply. So shall we get this line document to avoid future bugs by accident removal of this break
or even a change on the algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here, I'll add a comment in v2 (my code/comment ratio is way below my average as this is a product of many failed attempts)
enabled_features = self.get_features(target_branch) | ||
if not tests: | ||
for feature in self.all_set_of_tests: | ||
print(f"skip_{feature}=" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of this script is redirected (appended) to $GITHUB_OUTPUT
which end up as outputs of the skipper job. It's is expected the format key=value
as in https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs . Maybe worth a inline documentation to, again, avoid changing this by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me add it to v2
headers=_GH_HEADERS, | ||
timeout=60 | ||
) | ||
response.raise_for_status() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In two places it can crash this script ( response.raise_for_status()
throwing exception due API unavailable), failing the gatekeeper job and consequently blocking a PR. Can we resume the checking by re-running the gatekeeper job from the Github web UI?
Regardless, I think we should implement a retry mechanism to call the Github API. The interwebs says requests
has a retry implementation with backoff!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point I'd like to keep it this way to experiment with the limits. If it becomes an issue, we should get the exception with details. I haven't found the limits for jobs on GH runners but somewhere I read there should not be any limit.
Anyway it's just another workflow, one can re-try it from the UI.
fetch-depth: 0 | ||
- id: gatekeeper | ||
env: | ||
PR_NUMBER: ${{ github.event.pull_request.number }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR_NUMBER
seems unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, let me remove it in v2
for job, status in self.results | ||
if status == RUNNING]) | ||
print(f"{running_jobs} jobs are still running...") | ||
time.sleep(60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to parameterize this value, 60s seems too small (build and test jobs take several minutes to finish). I.e. we might need to tune that value to ensure we don't hit the API rate limits:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, you found the limits, good. We can change it as well as we can make it dynamic (let's say we expect the failures first after 15 or 30m, then we can start checking per 1m and after 2h increase it to hour (the assumption would be we know tests don't finish earlier than in 30m, then we want to be checking frequently to fail fast but when things take too long, we don't need that frequency and can check once an hour). Or anything else we can come-up with.
print(f"{running_jobs} jobs are still running...") | ||
time.sleep(60) | ||
continue | ||
sys.exit(ret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One improvement that I'm still unsure if it should implemented now or later, is about proper reporting the required jobs that failed. With this new approach we won't have the '(Required)' marker in front of the jobs and I'm afraid users (mainly newcomers) won't immediately open the gatekeeper's logs to check the jobs that failed.
And we might be on the edge for creating a Github App :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They would get a bunch of failed non-required tests but one (or multiple) failed required tests, one of them being the gatekeeper. And last note from gatekeeper should explain the situation and eventually navigate them to the right place.
* COMMIT_HASH: Full commit hash we want to be watching | ||
* GITHUB_REPOSITORY: Github repository (user/repo) | ||
Sample execution (GH token can be excluded): | ||
GITHUB_TOKEN="..." REQUIRED_JOBS="skipper / skipper" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies in advance if I'm just being boneheaded but line 10 above says that REQUIRED_JOBS
should be a comma-separated list which "skipper / skipper"
doesn't seem to be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, the full name of skipper job is skipper / skipper
as it includes the workflow_name
.
Would this also help in rerunning only failed jobs rather than rerunning a complete workflow? |
This PR adds 2 workflows:
The skipper can be used in workflows to detect changes and output set of tests to be executed (or skipped), the gatekeeper watches all the running jobs and checks the status of expected jobs to pass (based on changed files), which should replace the "required" feature of GH (once we make this workflow required).
Note this solution should be quite scalable so things like advanced heuristic or GH comments/labels overrides should be possible.
Example PRs (with slightly outdated gatekeeper but mainly similar):
Note you can get gatekeeper output of any existing PR (yours or kata-containers) by running something like
GITHUB_TOKEN="" REQUIRED_JOBS="foo;bar" REQUIRED_REGEXPS=".*" COMMIT_HASH=b8382cea886ad9a8f77d237bcfc0eba0c98775dd GITHUB_REPOSITORY=kata-containers/kata-containers python3 tools/testing/gatekeeper/jobs.py