-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QC-1171] CCDB api should timeout #13129
Conversation
REQUEST FOR PRODUCTION RELEASES:
This will add The following labels are available |
@Barthelemy @costing : While I think it is reasonable to add a time out, I do not think 1 second will work. This will also affect all GRID jobs, and there can be delays longer than 1 sec, and we do not want such workflows failing immediately. |
Yes. What about making the timeout a larger number and even make it configurable (via an environment variable)? Then we could adjust to the use-case. |
I agree, I would put at least 60 or 120 seconds as default.... |
In addition to making it configurable, we could make the default depend on the deployment type:
|
Thanks for all your input. I am going to add a default and make it configurable via an environment variable. |
I think for online it should be fast, since it should access the local CCDB cache. I would put perhaps 5s, to have some margin? Note that there is OnlineECS and OnlineDDS or so, at least there are multiple online cases. |
…en using the env var ALICEO2_CCDB_CURL_TIMEOUT.
I have updated the code. |
ok, thx. Then I have one more question. If I have an unstable conenction locally, I assume I can run into this 1s timeout. |
At the moment, if you have a connection problem , the processor will be stuck with no error messages. You can see how it confused Chiara in the JIRA ticket and we had to spend a bit of time together to understand what was going on. With this change we get the following error message:
Which seems to me pretty clear and certainly less confusing than the current situation. I could check if we could customize it to have a hint about the timeout env var. I used a default local timeout of 1 second based on my own use case: either you can quickly connect to the QCDB or it means that it is not there. I am not sure what would be a reasonible timeout in general for people running locally but I would be surprise that it is acceptable to have longer delays. Let me know ! |
I have added a specific message in case of timeouts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be squash-merged
Could one get a message also while it tries to connect? In my case, I was indeed not getting errors, but I would have not expected a timeout, I was really clueless. While if there was a message "trying to connect... timeout will be reached in XXX s" it would be clear that there is nothing stuck but the connection. |
I don't see the benefit of additional messages, if we now get an error after a timeout. In my opinion this just floods the logs further. |
Locally, I would not wait 120 seconds, I would think something else is broken, especially when running from CERN. |
...and it should not happen that much. In my case, that triggered this development, it was that QC was trying to access a prod QCDB which we cannot write to. So it would have never worked. |
But the 120s are only for GRID jobs? |
Yes, you are right. |
I would indeed keep it as it is to avoid extra log messages. The formatting has been fixed. It can and should be squashed-merged. |
For the actual timeout values I think 5s in Online and 15s in Offline should be plenty enough. The HTTP call itself should be very short, and even taking the RTT of remote sites into account it should stay way below 15s. Waiting longer is unlikely to solve anything in practice. |
I'll let you, Costin, David and Chiara decide on the offline timeout. |
@davidrohr @jgrosseo @chiarazampolli what is your opinion ? |
For me following Costin's guidance is fine. |
One more comment on this. I think the default value should be the larger of the two, also because Offline there are potentially many users and Grid workflows that would need to be tuned. In the Online environment it should be one place where the env variable would be set and could override the default to a smaller effective timeout. |
The default value is already the bigger of all (120s). I will then use 15s for Grid and as default. |
Could someone merge this please ? |
mCurlTimeout = timeout; | ||
} | ||
} else { // set a default depending on the deployment mode | ||
o2::framework::DeploymentMode deploymentMode = o2::framework::DefaultsHelpers::deploymentMode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch()...
The CCDB api never times out. If the server we are trying to reach is unreachable (e.g. QCDB from outside CERN), the workflow will just hang with no clear messages. Moreover, trying to kill it will also fail.
This is a proposal to introduce a timeout of 1 second.