Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement retries in vmexport download when server returns unexpected status #11911

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

alromeros
Copy link
Contributor

@alromeros alromeros commented May 14, 2024

What this PR does

An issue with the hotplug pod termination/export pod creation caused the memory dump to lack sufficient permissions to be manipulated by the server. This uncommon bug is usually solved automatically when restarting the exporter pod. A simple retry mechanism in vmexport download should help to mitigate this bug.

This Pull Request introduces the two following changes:

  • Improves handling of error cases in export server so we return appropriate status codes when the server lacks permission to read the files.
  • The vmexport client now retries the download if the server returns a bad status, helping to mitigate the previously described bug.

Fixes # https://issues.redhat.com/browse/CNV-39141

Special notes for your reviewer

I tried handling this differently by exiting the server and then restarting the pod from the controller, but that led to raciness and incosnsitencies with the virtualmachineexport resource. The current solution seemed worst at first but doesn't require messing with the controllers/export resources and is overall cleaner and easier to implement.

Release note

Bugfix: Implement retry mechanism in vmexport download

Signed-off-by: Alvaro Romero <alromero@redhat.com>
@kubevirt-bot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. labels May 14, 2024
@kubevirt-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alicefr for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alromeros
Copy link
Contributor Author

/test all

@alromeros alromeros marked this pull request as ready for review May 15, 2024 18:45
@alromeros alromeros changed the title [WIP] Implement retries in vmexport download when server returns unexpected status Implement retries in vmexport download when server returns unexpected status May 16, 2024
@kubevirt-bot kubevirt-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 16, 2024
@alromeros
Copy link
Contributor Author

/test all

@alromeros
Copy link
Contributor Author

/retest

@alromeros
Copy link
Contributor Author

/cc @awels

@kubevirt-bot kubevirt-bot requested a review from awels May 17, 2024 15:26
@@ -475,6 +475,11 @@ func archiveHandler(mountPoint string) http.Handler {
w.WriteHeader(http.StatusBadRequest)
return
}
if hasPermissions := checkDirectoryPermissions(mountPoint); !hasPermissions {
w.WriteHeader(http.StatusForbidden)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think 5xx error codes are more appropriate in this case. 4xx errors indicate an issue with the client. See: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#4xx_client_errors

@@ -144,6 +144,8 @@ var (
manifestOutputFormat string
)

const downloadRetries = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a --retry arg similar to curl (retry x times on transient error) would be better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, would we want to retry every time the download fails or just when the server returns a bad status (as originally intended)? I prefer the second but since we'll include a flag to explicitly request retries maybe it makes sense to retry even when other kind of error occurs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think it only makes sense to retry when the server returns bad status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still using 1 as default to mitigate transient server errors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default retries should be 0 what do you think @alicefr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can use 0 in vmexport if you prefer, but for the memory dump integration, I think we should at least use 1 to mitigate the bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Mike the default should be 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, left the higher default retries in the memory dump integration and justified it with a comment.

This commit implements a retry mechanism in the vmexport download command so, when the export server returns an unexpected status, we retry the download again.

Signed-off-by: Alvaro Romero <alromero@redhat.com>
Signed-off-by: Alvaro Romero <alromero@redhat.com>
@alromeros
Copy link
Contributor Author

/retest

1 similar comment
@alromeros
Copy link
Contributor Author

/retest

@alromeros alromeros requested a review from mhenriks May 23, 2024 11:54
return false
}

for _, item := range contents {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we cannot have subdirectories here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, with the current implementation we won't have any subdirectories here. This was discussed with the team and we decided there was no need to check permissions recursively.

This commit adds a flag that lets users specify the number of retrys the command will attempt before giving up.

Signed-off-by: Alvaro Romero <alromero@redhat.com>
@alromeros
Copy link
Contributor Author

/test pull-kubevirt-e2e-arm64

@alromeros alromeros requested a review from alicefr May 30, 2024 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/virtctl dco-signoff: yes Indicates the PR's author has DCO signed all their commits. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/storage size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants