New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NativeAOT] System.Collections.Concurrent.Tests crashing on linux-arm64 #102140
Comments
Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas |
The common pattern: We crash during GC stackroot enumeration:
The stacktrace of the target thread looks like this:
The IP that we are enumerating the GC roots for is near the start of |
@VSadov Could you please take a look? |
Looks like a GC hole. I run concurrent collection test with NativeAOT fairly often, so this must be something very recent. |
Ah these are on arm64. Not often then. |
So far the test has been passing for me locally. I've tried lin-arm64 and win-arm64. I will try the exact commit which was failing. Maybe the bug got fixed since then. |
reproduced in https://dev.azure.com/dnceng-public/public/_build/results?buildId=674034&view=logs&j=2f6a7d26-0d60-5ade-d191-981fe0847989 There is a suspicious safepoint in a partially interruptible method that is not following a call instruction. |
The reason for the failure is GC-reporting of uninitialized temp. We declare a temp to hold the address to the managed TLS blob. The temp is initialized by indirecting into native TLS. In some cases fetching the native TLS involves a method call. Roughly it looks like:
The optimizer assumes that if we did not see a safe point between prolog and the first assignment, then zero-initing is unnecessary. The problem is that optimizer does not know that TLS access may emit a call, and that call may introduce a safe point (as calls do by default). A simplest fix would be to not emit safepoints for calls into native TLS. They cannot participate in GC stackwalks anyways. |
The reason for the "suspicious" safe point that does not follow a call is that linker replaces the TLS pattern with completely different code that does not involve calls. So the safe point ends up not trailing a call. That by itself is ok, as long as we record correct GC info we can make any instruction a safe point. Only method calls that can participate in GC stack walk really have to be safe points. The only part that is not ok here is that optimizer does not know about this pattern. So it is either - teach optimizer about cases where TLS may introduce a safe point, or make TLS_GET_ADDR() not have a safe point. The latter is simpler. |
This is crashing in nearly every nativeaot outer loop run:
https://dev.azure.com/dnceng-public/public/_build/results?buildId=673104&view=ms.vss-test-web.build-test-results-tab&runId=16686708&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=155821
https://dev.azure.com/dnceng-public/public/_build/results?buildId=672653&view=ms.vss-test-web.build-test-results-tab&runId=16679414&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=155211
The text was updated successfully, but these errors were encountered: