Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: Select jobs by touched code #9637

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

ldoktor
Copy link
Contributor

@ldoktor ldoktor commented May 15, 2024

This PR adds 2 workflows:

  • gatekeeper-skipper
  • gatekeeper

The skipper can be used in workflows to detect changes and output set of tests to be executed (or skipped), the gatekeeper watches all the running jobs and checks the status of expected jobs to pass (based on changed files), which should replace the "required" feature of GH (once we make this workflow required).

Note this solution should be quite scalable so things like advanced heuristic or GH comments/labels overrides should be possible.

Example PRs (with slightly outdated gatekeeper but mainly similar):

Note you can get gatekeeper output of any existing PR (yours or kata-containers) by running something like GITHUB_TOKEN="" REQUIRED_JOBS="foo;bar" REQUIRED_REGEXPS=".*" COMMIT_HASH=b8382cea886ad9a8f77d237bcfc0eba0c98775dd GITHUB_REPOSITORY=kata-containers/kata-containers python3 tools/testing/gatekeeper/jobs.py

to allow selective testing as well as selective list of required tests
let's add a mapping of required jobs/tests in "skips.py" and a
"gatekeaper" workflow that will ensure the expected required jobs were
successful. Then we can only mark the "gatekeaper" as the required job
and modify the logic to suit our needs.

Fixes: kata-containers#9237

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
github action has a limit of 20 referenced workflows. Let's combine all
the amd64 related run-k8s-tests into a single job file to workaround
this limitation. They all share the same prerequisits so the result
should be the same.

Signed-off-by: Lukáš Doktor <ldoktor@redhat.com>
@katacontainersbot katacontainersbot added the size/huge Largest and most complex task (probably needs breaking into small pieces) label May 15, 2024
if regexp.search(line):
for feature in features:
enabled_features.add(feature)
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ldoktor !

This break is very important, as it applies the sort by order of importance rule. For example, if the first match suggest an empty list of features then, in practice, it is a skip; if there are other matches to this same line that suggest a list of features then they won't apply. So shall we get this line document to avoid future bugs by accident removal of this break or even a change on the algorithm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here, I'll add a comment in v2 (my code/comment ratio is way below my average as this is a product of many failed attempts)

enabled_features = self.get_features(target_branch)
if not tests:
for feature in self.all_set_of_tests:
print(f"skip_{feature}=" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of this script is redirected (appended) to $GITHUB_OUTPUT which end up as outputs of the skipper job. It's is expected the format key=value as in https://docs.github.com/en/actions/using-jobs/defining-outputs-for-jobs . Maybe worth a inline documentation to, again, avoid changing this by accident.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me add it to v2

headers=_GH_HEADERS,
timeout=60
)
response.raise_for_status()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In two places it can crash this script ( response.raise_for_status() throwing exception due API unavailable), failing the gatekeeper job and consequently blocking a PR. Can we resume the checking by re-running the gatekeeper job from the Github web UI?

Regardless, I think we should implement a retry mechanism to call the Github API. The interwebs says requests has a retry implementation with backoff!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point I'd like to keep it this way to experiment with the limits. If it becomes an issue, we should get the exception with details. I haven't found the limits for jobs on GH runners but somewhere I read there should not be any limit.

Anyway it's just another workflow, one can re-try it from the UI.

fetch-depth: 0
- id: gatekeeper
env:
PR_NUMBER: ${{ github.event.pull_request.number }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR_NUMBER seems unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, let me remove it in v2

for job, status in self.results
if status == RUNNING])
print(f"{running_jobs} jobs are still running...")
time.sleep(60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to parameterize this value, 60s seems too small (build and test jobs take several minutes to finish). I.e. we might need to tune that value to ensure we don't hit the API rate limits:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, you found the limits, good. We can change it as well as we can make it dynamic (let's say we expect the failures first after 15 or 30m, then we can start checking per 1m and after 2h increase it to hour (the assumption would be we know tests don't finish earlier than in 30m, then we want to be checking frequently to fail fast but when things take too long, we don't need that frequency and can check once an hour). Or anything else we can come-up with.

print(f"{running_jobs} jobs are still running...")
time.sleep(60)
continue
sys.exit(ret)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One improvement that I'm still unsure if it should implemented now or later, is about proper reporting the required jobs that failed. With this new approach we won't have the '(Required)' marker in front of the jobs and I'm afraid users (mainly newcomers) won't immediately open the gatekeeper's logs to check the jobs that failed.

And we might be on the edge for creating a Github App :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would get a bunch of failed non-required tests but one (or multiple) failed required tests, one of them being the gatekeeper. And last note from gatekeeper should explain the situation and eventually navigate them to the right place.

* COMMIT_HASH: Full commit hash we want to be watching
* GITHUB_REPOSITORY: Github repository (user/repo)
Sample execution (GH token can be excluded):
GITHUB_TOKEN="..." REQUIRED_JOBS="skipper / skipper"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies in advance if I'm just being boneheaded but line 10 above says that REQUIRED_JOBS should be a comma-separated list which "skipper / skipper" doesn't seem to be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, the full name of skipper job is skipper / skipper as it includes the workflow_name.

@zvonkok
Copy link
Contributor

zvonkok commented May 28, 2024

Would this also help in rerunning only failed jobs rather than rerunning a complete workflow?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/huge Largest and most complex task (probably needs breaking into small pieces)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants