Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QC-1171] CCDB api should timeout #13129

Merged
merged 8 commits into from
May 16, 2024
Merged

[QC-1171] CCDB api should timeout #13129

merged 8 commits into from
May 16, 2024

Conversation

Barthelemy
Copy link
Collaborator

The CCDB api never times out. If the server we are trying to reach is unreachable (e.g. QCDB from outside CERN), the workflow will just hang with no clear messages. Moreover, trying to kill it will also fail.

This is a proposal to introduce a timeout of 1 second.

@Barthelemy Barthelemy requested review from costing, sawenzel and a team as code owners May 13, 2024 07:38
Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass3
async-2023-pbpb-apass4
async-2022-pp-apass6-2023-PbPb-apass2
async-2022-pp-apass4
async-2022-pp-apass4-accepted
async-2022-pp-apass6-2023-PbPb-apass2-accepted
async-2023-pbpb-apass3-accepted
async-2023-pbpb-apass4-accepted
async-2023-pp-apass4
async-2023-pp-apass4-accepted
async-2024-pp-apass1
async-2024-pp-apass1-accepted
async-2022-pp-apass7
async-2022-pp-apass7-accepted
async-2024-pp-cpass0
async-2024-pp-cpass0-accepted

@davidrohr
Copy link
Collaborator

@Barthelemy @costing : While I think it is reasonable to add a time out, I do not think 1 second will work. This will also affect all GRID jobs, and there can be delays longer than 1 sec, and we do not want such workflows failing immediately.

@sawenzel
Copy link
Collaborator

Yes. What about making the timeout a larger number and even make it configurable (via an environment variable)? Then we could adjust to the use-case.

@jgrosseo
Copy link
Collaborator

I agree, I would put at least 60 or 120 seconds as default....

@davidrohr
Copy link
Collaborator

In addition to making it configurable, we could make the default depend on the deployment type:

o2::framework::DeploymentMode deploymentMode = o2::framework::DefaultsHelpers::deploymentMode();
if (deploymentMode == o2::framework::DeploymentMode::OnlineDDS) { ... }

@Barthelemy
Copy link
Collaborator Author

Thanks for all your input.

I am going to add a default and make it configurable via an environment variable.
I understand that the default for async should be high, I propose to use 120 seconds.
For online, what would be the default ?

@davidrohr
Copy link
Collaborator

I think for online it should be fast, since it should access the local CCDB cache. I would put perhaps 5s, to have some margin? Note that there is OnlineECS and OnlineDDS or so, at least there are multiple online cases.

…en using the env var ALICEO2_CCDB_CURL_TIMEOUT.
@Barthelemy
Copy link
Collaborator Author

I have updated the code.

CCDB/src/CcdbApi.cxx Outdated Show resolved Hide resolved
@davidrohr
Copy link
Collaborator

ok, thx. Then I have one more question. If I have an unstable conenction locally, I assume I can run into this 1s timeout.
What error message will I get? Will it hint that I perhaps have to increase the timeout via ALICEO2_CCDB_CURL_TIMEOUT?
Otherwise, I assume this might leave people stuck. Perhaps we should also be a bit more conservative, and not use 1s for the local case?

@Barthelemy
Copy link
Collaborator Author

At the moment, if you have a connection problem , the processor will be stuck with no error messages. You can see how it confused Chiara in the JIRA ticket and we had to spend a bit of time together to understand what was going on.

With this change we get the following error message:

[68708:qc-check-TST-QcCheck]: [13:53:37][ALARM] curl_easy_perform() failed: Timeout was reached

Which seems to me pretty clear and certainly less confusing than the current situation. I could check if we could customize it to have a hint about the timeout env var.

I used a default local timeout of 1 second based on my own use case: either you can quickly connect to the QCDB or it means that it is not there. I am not sure what would be a reasonible timeout in general for people running locally but I would be surprise that it is acceptable to have longer delays.

Let me know !

@Barthelemy
Copy link
Collaborator Author

I have added a specific message in case of timeouts.

davidrohr
davidrohr previously approved these changes May 13, 2024
Copy link
Collaborator

@davidrohr davidrohr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be squash-merged

@chiarazampolli
Copy link
Collaborator

Could one get a message also while it tries to connect? In my case, I was indeed not getting errors, but I would have not expected a timeout, I was really clueless. While if there was a message "trying to connect... timeout will be reached in XXX s" it would be clear that there is nothing stuck but the connection.

@davidrohr
Copy link
Collaborator

I don't see the benefit of additional messages, if we now get an error after a timeout. In my opinion this just floods the logs further.

@chiarazampolli
Copy link
Collaborator

Locally, I would not wait 120 seconds, I would think something else is broken, especially when running from CERN.

@chiarazampolli
Copy link
Collaborator

...and it should not happen that much. In my case, that triggered this development, it was that QC was trying to access a prod QCDB which we cannot write to. So it would have never worked.

@davidrohr
Copy link
Collaborator

But the 120s are only for GRID jobs?

@chiarazampolli
Copy link
Collaborator

Yes, you are right.

@Barthelemy
Copy link
Collaborator Author

I would indeed keep it as it is to avoid extra log messages.

The formatting has been fixed.

It can and should be squashed-merged.

@Barthelemy Barthelemy enabled auto-merge (squash) May 13, 2024 13:03
@costing
Copy link
Collaborator

costing commented May 13, 2024

For the actual timeout values I think 5s in Online and 15s in Offline should be plenty enough. The HTTP call itself should be very short, and even taking the RTT of remote sites into account it should stay way below 15s. Waiting longer is unlikely to solve anything in practice.

@Barthelemy
Copy link
Collaborator Author

I'll let you, Costin, David and Chiara decide on the offline timeout.

@Barthelemy
Copy link
Collaborator Author

@davidrohr @jgrosseo @chiarazampolli what is your opinion ?

@jgrosseo
Copy link
Collaborator

For me following Costin's guidance is fine.

@costing
Copy link
Collaborator

costing commented May 14, 2024

One more comment on this. I think the default value should be the larger of the two, also because Offline there are potentially many users and Grid workflows that would need to be tuned. In the Online environment it should be one place where the env variable would be set and could override the default to a smaller effective timeout.

@Barthelemy
Copy link
Collaborator Author

The default value is already the bigger of all (120s). I will then use 15s for Grid and as default.

@Barthelemy Barthelemy changed the title [QC-1171] CCDB api should timeout after 1 sec [QC-1171] CCDB api should timeout May 15, 2024
@Barthelemy
Copy link
Collaborator Author

Could someone merge this please ?

mCurlTimeout = timeout;
}
} else { // set a default depending on the deployment mode
o2::framework::DeploymentMode deploymentMode = o2::framework::DefaultsHelpers::deploymentMode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch()...

@ktf ktf disabled auto-merge May 16, 2024 06:46
@ktf ktf merged commit 92ee48d into dev May 16, 2024
12 of 13 checks passed
@ktf ktf deleted the timeout-ccdb-api branch May 16, 2024 06:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants