-
Notifications
You must be signed in to change notification settings - Fork 765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
infinite loop when shutting down #2537
Comments
Hi Freedge ! Thanks a lot for reporting this ! I try to reproduce, just to confirm you observe the same with the 3.0-dev7-ab7f05-82 2024/04/17 ? Many thanks in advance, |
so for my reproducer I am using the main branch which is commit ab7f05d at the moment and indeed reading "3.0-dev7-ab7f05-82" as version. but of course I had to add one line with a sleep to make it fail as intended, I can work on a reproducer with just gdb or stap if you believe there's some value with it.. On OpenShift the problem is reproduced with haproxy26-2.6.13-2.rhaos4.14.el8. The perf stat and gdb backtrace is identical there. It was overall, visible because of the high "system" CPU consumption on the node. |
Ok, thanks for details ! Steps that you've provided and a patch to have a stable behavior are completely sufficient for the moment, thanks for this ! |
Hi Freedge ! Could you share with us, please your glibc version, please (ldd --version) ? Regards, |
on my centos VM: ldd (GNU libc) 2.34 from glibc-common-2.34-105.el9.x86_64 |
Hi Freedge ! Could you try, please, to compile haproxy with this flag: DEBUG=-DDEBUG_THREAD and reproduce the infinite loop: make TARGET=linux-glibc USE_OPENSSL=0 USE_ZLIB=0 USE_PCRE=0 USE_LUA=0 USE_EPOLL=1 DEBUG=-DDEBUG_THREAD This will enable the dump of lock statistics to stderr by the main thread in wait_for_threads_completion () here. This could provide more insights. Kind regards, |
hi! This should be trivial to reproduce. As per the code you link, the show_lock_stats will only log once the pthread_join completes which is not the case. but I can trigger this from gdb if you feel it can be any useful:
|
Hi Freedge ! Could you, please, try this patch below and provide us a feedback.
Kind regards, |
hi, I tried and this indeed fixes the infinite loop. |
hi, why is this closed as completed? |
can you reopen this? |
This patch fixes the commit eea152e ("BUG/MINOR: signals/poller: ensure wakeup from signals"). There is some probability that run_poll_loop() becomes inifinite, if TH_FL_SLEEPING is withdrawn from all threads in the second signal_queue_len check, when a signal has received just after the first one. In such particular case, the 'wake' variable, which is used to terminate thread's poll loop is never reset to 0. So, we never enter to the "stopping" part of the run_poll_loop() and threads, except the one with id 0 (tid 0 handles signals), will continue to call _do_poll() eternally and will never sleep, as its TH_FL_SLEEPING flag was unset. This flag needs to be removed only for the tid 0, as it was done in the first signal_queue_len check. This fixes an issue #2537 "infinite loop when shutting down". This fix must be backported in every stable version.
The fix was merged and will be back-ported soon. |
@vkssv please don't close the tickets until the fixes are backported in all branches that are concerned. |
This patch fixes the commit eea152e ("BUG/MINOR: signals/poller: ensure wakeup from signals"). There is some probability that run_poll_loop() becomes inifinite, if TH_FL_SLEEPING is withdrawn from all threads in the second signal_queue_len check, when a signal has received just after the first one. In such particular case, the 'wake' variable, which is used to terminate thread's poll loop is never reset to 0. So, we never enter to the "stopping" part of the run_poll_loop() and threads, except the one with id 0 (tid 0 handles signals), will continue to call _do_poll() eternally and will never sleep, as its TH_FL_SLEEPING flag was unset. This flag needs to be removed only for the tid 0, as it was done in the first signal_queue_len check. This fixes an issue haproxy#2537 "infinite loop when shutting down". This fix must be backported in every stable version. (cherry picked from commit 4a9e3e1) Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com>
This patch fixes the commit eea152e ("BUG/MINOR: signals/poller: ensure wakeup from signals"). There is some probability that run_poll_loop() becomes inifinite, if TH_FL_SLEEPING is withdrawn from all threads in the second signal_queue_len check, when a signal has received just after the first one. In such particular case, the 'wake' variable, which is used to terminate thread's poll loop is never reset to 0. So, we never enter to the "stopping" part of the run_poll_loop() and threads, except the one with id 0 (tid 0 handles signals), will continue to call _do_poll() eternally and will never sleep, as its TH_FL_SLEEPING flag was unset. This flag needs to be removed only for the tid 0, as it was done in the first signal_queue_len check. This fixes an issue haproxy#2537 "infinite loop when shutting down". This fix must be backported in every stable version. (cherry picked from commit 4a9e3e1) Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com> (cherry picked from commit 02819d2) Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com>
This patch fixes the commit eea152e ("BUG/MINOR: signals/poller: ensure wakeup from signals"). There is some probability that run_poll_loop() becomes inifinite, if TH_FL_SLEEPING is withdrawn from all threads in the second signal_queue_len check, when a signal has received just after the first one. In such particular case, the 'wake' variable, which is used to terminate thread's poll loop is never reset to 0. So, we never enter to the "stopping" part of the run_poll_loop() and threads, except the one with id 0 (tid 0 handles signals), will continue to call _do_poll() eternally and will never sleep, as its TH_FL_SLEEPING flag was unset. This flag needs to be removed only for the tid 0, as it was done in the first signal_queue_len check. This fixes an issue haproxy#2537 "infinite loop when shutting down". This fix must be backported in every stable version. (cherry picked from commit 4a9e3e1) Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com> (cherry picked from commit 02819d2) Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com> (cherry picked from commit c58e4a4) [ad: context adjustment] Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com> (cherry picked from commit 722b22c) Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com>
This patch fixes the commit eea152e ("BUG/MINOR: signals/poller: ensure wakeup from signals"). There is some probability that run_poll_loop() becomes inifinite, if TH_FL_SLEEPING is withdrawn from all threads in the second signal_queue_len check, when a signal has received just after the first one. In such particular case, the 'wake' variable, which is used to terminate thread's poll loop is never reset to 0. So, we never enter to the "stopping" part of the run_poll_loop() and threads, except the one with id 0 (tid 0 handles signals), will continue to call _do_poll() eternally and will never sleep, as its TH_FL_SLEEPING flag was unset. This flag needs to be removed only for the tid 0, as it was done in the first signal_queue_len check. This fixes an issue haproxy#2537 "infinite loop when shutting down". This fix must be backported in every stable version. (cherry picked from commit 4a9e3e1) Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com> (cherry picked from commit 02819d2) Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com> (cherry picked from commit c58e4a4) [ad: context adjustment] Signed-off-by: Amaury Denoyelle <adenoyelle@haproxy.com>
Detailed Description of the Problem
this was originally observed on a haproxy 2.6 deployment in OpenShift.
When shutting down, we observe a race condition where the thread with tid=0 waits for other threads to complete, while some remaining threads will loop and take 100% CPU doing syscalls epoll_wait and clock_gettime
Expected Behavior
haproxy should shut down cleanly
Steps to Reproduce the Behavior
make TARGET=linux-glibc USE_OPENSSL=0 USE_ZLIB=0 USE_PCRE=0 USE_LUA=0 USE_EPOLL=1
./haproxy -f examples/quick-test.cfg -p ${XDG_RUNTIME_DIR}/haproxy.pid -d
while sleep 1 ; do ps -p `pgrep haproxy` -T -o tid,wchan; done
pkill -USR1 haproxy
pkill -USR1 haproxy
Do you have any idea what may have caused this?
I believe some sort of race condition
Do you have an idea how to solve the issue?
No response
What is your configuration?
haproxy from main branch
Output of
haproxy -vv
Last Outputs and Backtraces
Additional Information
The text was updated successfully, but these errors were encountered: