Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

i#6814: Fix stack overflow on signal delivery to mid-detach thread #6815

Merged
merged 35 commits into from
May 22, 2024

Conversation

abhinav92003
Copy link
Contributor

@abhinav92003 abhinav92003 commented May 16, 2024

Fixes two stack overflow scenarios that occur when DR delivers an app signal to
the native signal handler for a thread that is mid-detach.

First case: when a thread is handling the suspend signal and is waiting for the
detacher thread to wake it up and tell it to continue detaching. Currently, DR
unblocks signals before starting the wait. If the signal is delivered at this
point, currently execute_native_handler() incorrectly delivers the signal
to the native handler on DR's own signal stack. To fix this: we now do not
unblock signals during this wait as it complicates native signal delivery,
particularly for the second case described below.

Additionally, for a detaching thread, we now do not explicitly
restore the app's sigblocked mask; DR already restores the mask on the signal
frame, which would be restored automatically when the thread returns from the DR
detach signal handler. This avoids another case where the app may be on DR's
signal stack when the native signal is delivered.

Second case: when the thread has been woken up by the detacher thread, executed
sig_detach, and reinstated the app signal stack (if available). If the signal is
delivered at this point, execute_native_handler() adds a new signal frame on top
of DR's own signal frame on the app stack and invokes the native signal handler. This
ends up taking too much stack space which causes a stack overflow, as observed on
apps with frequent profiling signals that use the stack-intensive libunwind to get a
stack trace for all threads. To fix this: we reuse the same signal frame for
delivering the signal to the native signal handler, when the app doesn't need a
non-RT frame.

The new code is exercised by the existing detach_signal test. Also modified the test
to have some threads that have a very small sigstack, which helps reproduce the
crash originally seen on a real app. (There was already a note in detach_signal test
about using a large sigstack to avoid this stack overflow.)

Tested on an internal app where failures reduce from ~136/4000 to ~1/4000.

Issue: #6814

Fixes two stack overflow scenarios caused when DR delivers an app signal to the
native signal handler for a thread that is in mid-detach.

First case: when a thread is handling the suspend signal and is waiting for the
detacher thread to wake it up and tell it to continue detaching. Currently, DR
unblocks signals before starting the wait. If the signal is delivered at this
point, currently execute_native_handler() incorrectly delivers the signal
to the native handler on DR's own signal stack. To fix this: we now do not
unblock signals during this wait as it complicates native signal delivery,
particularly for the second case described below.

Additionally, for a detaching thread, we now do not explicitly
restore the app's sigblocked mask; DR already restores the mask on the signal
frame, which would be restored automatically when the thread returns from the DR
detach signal handler. This avoids another case where the app may be on DR's
signal stack when the native signal is delivered.

Second case: when the thread has been woken up by the detacher thread, executed
sig_detach, and reinstated the app signal stack (if available). If the signal is
delivered at this point, execute_native_handler() adds a new signal frame on top
of DR's own signal frame and invokes the native signal handler. This ends up
taking too much stack space which causes a stack overflow, as observed on apps
with frequent profiling signals that use the stack-intensive libunwind to get a
stack trace for all threads. To fix this: we reuse the same signal frame for
delivering the signal to the native signal handler.

Tested on an internal app where failures reduce from ~136/4000 to ~1/4000.

Issue: #6814
core/dynamo.c Show resolved Hide resolved
core/unix/os.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Show resolved Hide resolved
core/unix/signal.c Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Show resolved Hide resolved
core/unix/signal.c Outdated Show resolved Hide resolved
core/unix/signal.c Show resolved Hide resolved
@abhinav92003
Copy link
Contributor Author

detach_signal seems to be failing on Github Actions, but on my local runs (on both x86 and AArch64) it works fine.

@abhinav92003
Copy link
Contributor Author

detach_signal seems to be failing on Github Actions, but on my local runs (on both x86 and AArch64) it works fine.

The sigmask to be restored before passing control to the app handler needed to include also the app's native sigmask from the frame. Added that, which fixed the failure.

@abhinav92003
Copy link
Contributor Author

The opcode_mix test is still flaky on x86-32 and needed some reruns (though x86-32 is not blocking submission)

@abhinav92003 abhinav92003 merged commit 427e33e into master May 22, 2024
17 checks passed
@abhinav92003 abhinav92003 deleted the i6814-signal-detach-crash branch May 22, 2024 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants