Stack overflow crash on app signal delivery to gone-native threads #6814

abhinav92003 · 2024-05-16T17:36:59Z

DR may attempt to execute the native app signal handler for a thread in the middle of detach at the following points:

it is waiting for the detacher thread to wake it up and continue sig_detach (

dynamorio/core/unix/signal.c

Line 8598 in a559edc

while (ksynch_get_value(&ostd->wakeup) == 0) {

)
it has completed sig_detach and reinstated the app sigblocked mask, but DR's signal handler is still installed

In both these cases execute_native_handler puts the native signal frame on top of DR's own signal frame, and notes the potential of a stack overflow

dynamorio/core/unix/signal.c

Line 3524 in a559edc

/* Handle DR's frame already being on the app stack. For native delivery we

In case (1), the stack overflow may happen on DR's own signal stack, whereas in (2) it may happen on the app stack (signal stack or regular). We're seeing this happen on some large apps that deal with frequent profiling signals that use libunwind to get the thread's stack trace (libunwind is known to have high stack usage).

The solution:

(1) do not unblock signals when detaching
(2) when the native signal handler has to be executed using the same stack as the one DR's signal handler is currently on, we can just reuse the same frame and pass control on to the app's signal handler

Fixes two stack overflow scenarios caused when DR delivers an app signal to the native signal handler for a thread that is in mid-detach. First case: when a thread is handling the suspend signal and is waiting for the detacher thread to wake it up and tell it to continue detaching. Currently, DR unblocks signals before starting the wait. If the signal is delivered at this point, currently execute_native_handler() incorrectly delivers the signal to the native handler on DR's own signal stack. To fix this: we now do not unblock signals during this wait as it complicates native signal delivery, particularly for the second case described below. Additionally, for a detaching thread, we now do not explicitly restore the app's sigblocked mask; DR already restores the mask on the signal frame, which would be restored automatically when the thread returns from the DR detach signal handler. This avoids another case where the app may be on DR's signal stack when the native signal is delivered. Second case: when the thread has been woken up by the detacher thread, executed sig_detach, and reinstated the app signal stack (if available). If the signal is delivered at this point, execute_native_handler() adds a new signal frame on top of DR's own signal frame and invokes the native signal handler. This ends up taking too much stack space which causes a stack overflow, as observed on apps with frequent profiling signals that use the stack-intensive libunwind to get a stack trace for all threads. To fix this: we reuse the same signal frame for delivering the signal to the native signal handler. Tested on an internal app where failures reduce from ~136/4000 to ~1/4000. Issue: #6814

…6815) Fixes two stack overflow scenarios that occur when DR delivers an app signal to the native signal handler for a thread that is mid-detach. First case: when a thread is handling the suspend signal and is waiting for the detacher thread to wake it up and tell it to continue detaching. Currently, DR unblocks signals before starting the wait. If the signal is delivered at this point, currently execute_native_handler() incorrectly delivers the signal to the native handler on DR's own signal stack. To fix this: we now do not unblock signals during this wait as it complicates native signal delivery, also for the second case described below. Additionally, for a detaching thread, we now do not explicitly restore the app's sigblocked mask; DR already restores the mask on the signal frame, which would be restored automatically when the thread returns from the DR detach signal handler. This avoids another case where the app may be on DR's signal stack when the native signal is delivered. Second case: when the thread has been woken up by the detacher thread, executed sig_detach, and reinstated the app signal stack (if available). If the signal is delivered at this point, execute_native_handler() adds a new signal frame on top of DR's own signal frame on the app stack and invokes the native signal handler. This sometimes ends up taking too much stack space which causes a stack overflow, as observed on an internal app with frequent profiling signals that use the stack-intensive libunwind to get a stack trace for all threads. To fix this: we reuse the same signal frame for delivering the signal to the native signal handler, when the app doesn't need a non-RT frame. The new code is exercised by the existing detach_signal test. Also modified the test to have some threads that have a very small sigstack, which helps reproduce the crash originally seen on a real app. (There was already a note in detach_signal test about using a large sigstack to avoid this stack overflow.) Tested on an internal app where failures reduce from ~136/4000 to ~1/4000. Issue: #6814

Sets the x30 link register on AArch64 appropriately to dynamorio_sigreturn when reusing the current frame for native signal delivery. The missing x30 value caused issues in an internal test. This doesn't affect the detach_signal test likely because it uses a longjmp to return from the signal, not a sigreturn. Issue: #6814

Sets the x30 link register on AArch64 appropriately to dynamorio_sigreturn when reusing the current frame for native signal delivery. The missing x30 value doesn't affect our existing detach_signal test because it uses a longjmp to return from the signal, not a sigreturn. Modifies detach_signal to fail without this fix. Issue: #6814

abhinav92003 · 2024-05-29T02:01:25Z

Xref #6830 that fixes another related scenario.

abhinav92003 added the Bug-DRCrash label May 16, 2024

abhinav92003 mentioned this issue May 16, 2024

i#6814: Fix stack overflow on signal delivery to mid-detach thread #6815

Merged

abhinav92003 mentioned this issue May 24, 2024

i#6814: Set AArch64 link reg for native sig delivery #6827

Merged

abhinav92003 closed this as completed May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stack overflow crash on app signal delivery to gone-native threads #6814

Stack overflow crash on app signal delivery to gone-native threads #6814

abhinav92003 commented May 16, 2024

abhinav92003 commented May 29, 2024

Stack overflow crash on app signal delivery to gone-native threads #6814

Stack overflow crash on app signal delivery to gone-native threads #6814

Comments

abhinav92003 commented May 16, 2024

abhinav92003 commented May 29, 2024