[QC-1171] CCDB api should timeout #13129

Barthelemy · 2024-05-13T07:38:17Z

The CCDB api never times out. If the server we are trying to reach is unreachable (e.g. QCDB from outside CERN), the workflow will just hang with no clear messages. Moreover, trying to kill it will also fail.

This is a proposal to introduce a timeout of 1 second.

github-actions · 2024-05-13T07:38:32Z

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass3
async-2023-pbpb-apass4
async-2022-pp-apass6-2023-PbPb-apass2
async-2022-pp-apass4
async-2022-pp-apass4-accepted
async-2022-pp-apass6-2023-PbPb-apass2-accepted
async-2023-pbpb-apass3-accepted
async-2023-pbpb-apass4-accepted
async-2023-pp-apass4
async-2023-pp-apass4-accepted
async-2024-pp-apass1
async-2024-pp-apass1-accepted
async-2022-pp-apass7
async-2022-pp-apass7-accepted
async-2024-pp-cpass0
async-2024-pp-cpass0-accepted

davidrohr · 2024-05-13T07:40:12Z

@Barthelemy @costing : While I think it is reasonable to add a time out, I do not think 1 second will work. This will also affect all GRID jobs, and there can be delays longer than 1 sec, and we do not want such workflows failing immediately.

sawenzel · 2024-05-13T08:05:21Z

Yes. What about making the timeout a larger number and even make it configurable (via an environment variable)? Then we could adjust to the use-case.

jgrosseo · 2024-05-13T08:45:49Z

I agree, I would put at least 60 or 120 seconds as default....

davidrohr · 2024-05-13T08:50:27Z

In addition to making it configurable, we could make the default depend on the deployment type:

o2::framework::DeploymentMode deploymentMode = o2::framework::DefaultsHelpers::deploymentMode();
if (deploymentMode == o2::framework::DeploymentMode::OnlineDDS) { ... }

Barthelemy · 2024-05-13T09:01:24Z

Thanks for all your input.

I am going to add a default and make it configurable via an environment variable.
I understand that the default for async should be high, I propose to use 120 seconds.
For online, what would be the default ?

davidrohr · 2024-05-13T09:12:09Z

I think for online it should be fast, since it should access the local CCDB cache. I would put perhaps 5s, to have some margin? Note that there is OnlineECS and OnlineDDS or so, at least there are multiple online cases.

…en using the env var ALICEO2_CCDB_CURL_TIMEOUT.

Barthelemy · 2024-05-13T11:56:22Z

I have updated the code.

CCDB/src/CcdbApi.cxx

davidrohr · 2024-05-13T12:05:09Z

ok, thx. Then I have one more question. If I have an unstable conenction locally, I assume I can run into this 1s timeout.
What error message will I get? Will it hint that I perhaps have to increase the timeout via ALICEO2_CCDB_CURL_TIMEOUT?
Otherwise, I assume this might leave people stuck. Perhaps we should also be a bit more conservative, and not use 1s for the local case?

Barthelemy · 2024-05-13T12:10:29Z

At the moment, if you have a connection problem , the processor will be stuck with no error messages. You can see how it confused Chiara in the JIRA ticket and we had to spend a bit of time together to understand what was going on.

With this change we get the following error message:

[68708:qc-check-TST-QcCheck]: [13:53:37][ALARM] curl_easy_perform() failed: Timeout was reached

Which seems to me pretty clear and certainly less confusing than the current situation. I could check if we could customize it to have a hint about the timeout env var.

I used a default local timeout of 1 second based on my own use case: either you can quickly connect to the QCDB or it means that it is not there. I am not sure what would be a reasonible timeout in general for people running locally but I would be surprise that it is acceptable to have longer delays.

Let me know !

Barthelemy · 2024-05-13T12:18:12Z

I have added a specific message in case of timeouts.

davidrohr

should be squash-merged

chiarazampolli · 2024-05-13T12:33:57Z

Could one get a message also while it tries to connect? In my case, I was indeed not getting errors, but I would have not expected a timeout, I was really clueless. While if there was a message "trying to connect... timeout will be reached in XXX s" it would be clear that there is nothing stuck but the connection.

davidrohr · 2024-05-13T12:36:23Z

I don't see the benefit of additional messages, if we now get an error after a timeout. In my opinion this just floods the logs further.

chiarazampolli · 2024-05-13T12:38:18Z

Locally, I would not wait 120 seconds, I would think something else is broken, especially when running from CERN.

chiarazampolli · 2024-05-13T12:39:16Z

...and it should not happen that much. In my case, that triggered this development, it was that QC was trying to access a prod QCDB which we cannot write to. So it would have never worked.

davidrohr · 2024-05-13T12:39:41Z

But the 120s are only for GRID jobs?

chiarazampolli · 2024-05-13T12:52:37Z

Yes, you are right.

Barthelemy · 2024-05-13T13:02:36Z

I would indeed keep it as it is to avoid extra log messages.

The formatting has been fixed.

It can and should be squashed-merged.

costing · 2024-05-13T13:48:51Z

For the actual timeout values I think 5s in Online and 15s in Offline should be plenty enough. The HTTP call itself should be very short, and even taking the RTT of remote sites into account it should stay way below 15s. Waiting longer is unlikely to solve anything in practice.

Barthelemy · 2024-05-13T14:28:36Z

I'll let you, Costin, David and Chiara decide on the offline timeout.

Barthelemy · 2024-05-14T05:54:15Z

@davidrohr @jgrosseo @chiarazampolli what is your opinion ?

jgrosseo · 2024-05-14T06:35:57Z

For me following Costin's guidance is fine.

costing · 2024-05-14T06:37:11Z

One more comment on this. I think the default value should be the larger of the two, also because Offline there are potentially many users and Grid workflows that would need to be tuned. In the Online environment it should be one place where the env variable would be set and could override the default to a smaller effective timeout.

Barthelemy · 2024-05-14T08:05:47Z

The default value is already the bigger of all (120s). I will then use 15s for Grid and as default.

Barthelemy · 2024-05-15T12:34:17Z

Could someone merge this please ?

ktf · 2024-05-16T06:46:14Z

CCDB/src/CcdbApi.cxx

+      mCurlTimeout = timeout;
+    }
+  } else { // set a default depending on the deployment mode
+    o2::framework::DeploymentMode deploymentMode = o2::framework::DefaultsHelpers::deploymentMode();


switch()...

[QC-1171] CCDB api should timeout after 1 sec

9af9a0b

Barthelemy requested review from costing, sawenzel and a team as code owners May 13, 2024 07:38

Curl timeout is set according to deployment mode. It can be overwritt…

52868c5

…en using the env var ALICEO2_CCDB_CURL_TIMEOUT.

davidrohr reviewed May 13, 2024

View reviewed changes

CCDB/src/CcdbApi.cxx Outdated Show resolved Hide resolved

Barthelemy added 2 commits May 13, 2024 14:01

removing log mesage

f1bd700

format

0c55063

add an error message specific to timeouts

c5d38b9

davidrohr previously approved these changes May 13, 2024

View reviewed changes

formatting

246b5c6

Barthelemy dismissed davidrohr’s stale review via 246b5c6 May 13, 2024 13:02

Barthelemy enabled auto-merge (squash) May 13, 2024 13:03

Barthelemy added 2 commits May 14, 2024 10:07

update the default and offline value to 15 seconds.

1921e9b

formatting

0ca4361

Barthelemy changed the title ~~[QC-1171] CCDB api should timeout after 1 sec~~ [QC-1171] CCDB api should timeout May 15, 2024

ktf reviewed May 16, 2024

View reviewed changes

ktf disabled auto-merge May 16, 2024 06:46

ktf merged commit 92ee48d into dev May 16, 2024
12 of 13 checks passed

ktf deleted the timeout-ccdb-api branch May 16, 2024 06:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QC-1171] CCDB api should timeout #13129

[QC-1171] CCDB api should timeout #13129

Barthelemy commented May 13, 2024

github-actions bot commented May 13, 2024

davidrohr commented May 13, 2024

sawenzel commented May 13, 2024

jgrosseo commented May 13, 2024

davidrohr commented May 13, 2024

Barthelemy commented May 13, 2024

davidrohr commented May 13, 2024

Barthelemy commented May 13, 2024

davidrohr commented May 13, 2024

Barthelemy commented May 13, 2024

Barthelemy commented May 13, 2024

davidrohr left a comment

chiarazampolli commented May 13, 2024

davidrohr commented May 13, 2024

chiarazampolli commented May 13, 2024

chiarazampolli commented May 13, 2024

davidrohr commented May 13, 2024

chiarazampolli commented May 13, 2024

Barthelemy commented May 13, 2024

costing commented May 13, 2024

Barthelemy commented May 13, 2024

Barthelemy commented May 14, 2024

jgrosseo commented May 14, 2024

costing commented May 14, 2024

Barthelemy commented May 14, 2024

Barthelemy commented May 15, 2024

ktf May 16, 2024

[QC-1171] CCDB api should timeout #13129

[QC-1171] CCDB api should timeout #13129

Conversation

Barthelemy commented May 13, 2024

github-actions bot commented May 13, 2024

davidrohr commented May 13, 2024

sawenzel commented May 13, 2024

jgrosseo commented May 13, 2024

davidrohr commented May 13, 2024

Barthelemy commented May 13, 2024

davidrohr commented May 13, 2024

Barthelemy commented May 13, 2024

davidrohr commented May 13, 2024

Barthelemy commented May 13, 2024

Barthelemy commented May 13, 2024

davidrohr left a comment

Choose a reason for hiding this comment

chiarazampolli commented May 13, 2024

davidrohr commented May 13, 2024

chiarazampolli commented May 13, 2024

chiarazampolli commented May 13, 2024

davidrohr commented May 13, 2024

chiarazampolli commented May 13, 2024

Barthelemy commented May 13, 2024

costing commented May 13, 2024

Barthelemy commented May 13, 2024

Barthelemy commented May 14, 2024

jgrosseo commented May 14, 2024

costing commented May 14, 2024

Barthelemy commented May 14, 2024

Barthelemy commented May 15, 2024

ktf May 16, 2024

Choose a reason for hiding this comment