Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] System.Collections.Concurrent.Tests crashing on linux-arm64 #102140

Closed
jkotas opened this issue May 12, 2024 · 10 comments · Fixed by #102237
Closed

[NativeAOT] System.Collections.Concurrent.Tests crashing on linux-arm64 #102140

jkotas opened this issue May 12, 2024 · 10 comments · Fixed by #102237
Labels
Milestone

Comments

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label May 12, 2024
Copy link
Contributor

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member Author

jkotas commented May 12, 2024

The common pattern:

We crash during GC stackroot enumeration:

* thread #1, name = 'System.Collecti', stop reason = signal SIGSEGV
  * frame #0: 0x0000aaaaea9db520 System.Collections.Concurrent.Tests`MethodTable::Validate(bool) [inlined] MethodTable::HasComponentSize(this=0x7fffffff000438d0) at MethodTable.h:226:25 [opt]
    frame #1: 0x0000aaaaea9db520 System.Collections.Concurrent.Tests`MethodTable::Validate(this=0x7fffffff000438d0, assertOnFail=<unavailable>) at MethodTable.cpp:29:9 [opt]
    frame #2: 0x0000aaaaea9a8a1c System.Collections.Concurrent.Tests`WKS::CObjectHeader::Validate(int, int, int) [inlined] MethodTable::SanityCheck(this=<unavailable>) at MethodTable.h:294:40 [opt]
    frame #3: 0x0000aaaaea9a8a18 System.Collections.Concurrent.Tests`WKS::CObjectHeader::Validate(this=0x0000ffbfc6413320, bDeep=YES, bVerifyNextHeader=16, bVerifySyncBlock=16) at gc.cpp:4680:9 [opt]
    frame #4: 0x0000aaaaea9c259c System.Collections.Concurrent.Tests`WKS::GCHeap::Promote(ppObject=0x0000ffbf097ad418, sc=<unavailable>, flags=0) at gc.cpp:49232:26 [opt]
    frame #5: 0x0000aaaaea9eac78 System.Collections.Concurrent.Tests`GcInfoDecoder::EnumerateLiveSlots(this=0x0000ffbeebffda40, pRD=<unavailable>, reportScratchSlots=false, inputFlags=1, pCallBack=<unavailable>, hCallBack=<unavailable>)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:896:21 [opt]
    frame #6: 0x0000aaaaea9eddc0 System.Collections.Concurrent.Tests`UnixNativeCodeManager::EnumGcRefs(this=<unavailable>, pMethodInfo=0x0000ffbeebffdfe8, safePointAddress=<unavailable>, pRegisterSet=0x0000ffbeebffde98, hCallback=0x0000ffbeebffdb30, isActiveStackFrame=<unavailable>) at UnixNativeCodeManager.cpp:263:18 [opt]
    frame #7: 0x0000aaaaea994028 System.Collections.Concurrent.Tests`EnumGcRefs(pCodeManager=0x0000aaab044e2830, pMethodInfo=0x0000ffbeebffdfe8, safePointAddress=0x0000aaaaead28a00, pRegisterSet=0x0000ffbeebffde98, pfnEnumCallback=(System.Collections.Concurrent.Tests`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:49179), pvCallbackData=0x0000ffbeebffe190, isActiveStackFrame=<unavailable>)(Object**, ScanContext*, unsigned int), ScanContext*, bool) at GcEnum.cpp:139:19 [opt]
    frame #8: 0x0000aaaaea99a774 System.Collections.Concurrent.Tests`Thread::GcScanRootsWorker(this=0x0000ffbf097af7f0, pfnEnumCallback=(System.Collections.Concurrent.Tests`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:49179), pvCallbackData=0x0000ffbeebffe190, frameIterator=0x0000ffbeebffde78)(Object**, ScanContext*, unsigned int), ScanContext*, StackFrameIterator&) at thread.cpp:509:13 [opt]
    frame #9: 0x0000aaaaea99a3cc System.Collections.Concurrent.Tests`Thread::GcScanRoots(this=0x0000ffbf097af7f0, pfnEnumCallback=(System.Collections.Concurrent.Tests`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:49179), pvCallbackData=0x0000ffbeebffe190)(Object**, ScanContext*, unsigned int), ScanContext*) at thread.cpp:401:5 [opt]
    frame #10: 0x0000aaaaea992dec System.Collections.Concurrent.Tests`GCToEEInterface::GcScanRoots(fn=(System.Collections.Concurrent.Tests`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:49179), condemned=<unavailable>, max_gen=<unavailable>, sc=0x0000ffbeebffe190)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) at gcenv.ee.cpp:122:22 [opt]
    frame #11: 0x0000aaaaea9b4590 System.Collections.Concurrent.Tests`WKS::gc_heap::mark_phase(condemned_gen_number=0) at gc.cpp:29664:9 [opt]
    frame #12: 0x0000aaaaea9b12b8 System.Collections.Concurrent.Tests`WKS::gc_heap::gc1() at gc.cpp:22184:13 [opt]
    frame #13: 0x0000aaaaea9bc2dc System.Collections.Concurrent.Tests`WKS::gc_heap::garbage_collect(n=0) at gc.cpp:0:21 [opt]
    frame #14: 0x0000aaaaea9ac520 System.Collections.Concurrent.Tests`WKS::GCHeap::GarbageCollectGeneration(this=<unavailable>, gen=0, reason=reason_alloc_soh) at gc.cpp:50733:9 [opt]
    frame #15: 0x0000aaaaea9ad8bc System.Collections.Concurrent.Tests`WKS::gc_heap::trigger_gc_for_alloc(gen_number=0, gr=reason_alloc_soh, msl=<unavailable>, loh_p=<unavailable>, take_state=<unavailable>) at gc.cpp:18766:14 [opt]
    frame #16: 0x0000aaaaea9ae65c System.Collections.Concurrent.Tests`WKS::gc_heap::try_allocate_more_space(acontext=0x0000ffbeebfff7f0, size=24, flags=2, gen_number=0) at gc.cpp:18904:34 [opt]
    frame #17: 0x0000aaaaea9d6270 System.Collections.Concurrent.Tests`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) [inlined] WKS::gc_heap::allocate_more_space(acontext=0x0000ffbeebfff7f0, size=24, flags=2, alloc_generation_number=0) at gc.cpp:19404:18 [opt]
    frame #18: 0x0000aaaaea9d625c System.Collections.Concurrent.Tests`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) at gc.cpp:19435:19 [opt]
    frame #19: 0x0000aaaaea9d6234 System.Collections.Concurrent.Tests`WKS::GCHeap::Alloc(this=0x0000aaab044daeb0, context=0x0000ffbeebfff7f0, size=24, flags=2) at gc.cpp:49657:34 [opt]
    frame #20: 0x0000aaaaea991930 System.Collections.Concurrent.Tests`GcAllocInternal(pEEType=0x0000aaaaeb9f5aa8, uFlags=2, numElements=0, pThread=0x0000ffbeebfff7f0) at GCHelpers.cpp:542:53 [opt]
    frame #21: 0x0000aaaaea9f36ec System.Collections.Concurrent.Tests`RhpNewObject at AllocFast.S:88
    frame #22: 0x0000aaaaeb09cdd4 System.Collections.Concurrent.Tests`xunit_assert_Xunit_Sdk_AssertEqualityComparer_1<Int32>___ctor(this=0x0000ffbfc6c27c60, innerComparer=0x0000000000000000) at ParallelLoopState.cs:400
    frame #23: 0x0000aaaaeab30700 System.Collections.Concurrent.Tests`System_Collections_Concurrent_Tests_System_Collections_Concurrent_Tests_ProducerConsumerCollectionTests___c__DisplayClass40_0___BlockingCollection_WrappingCollection_ExpectedElementsTransferred_b__3(this=0x0000ffbfc6413198) at ProducerConsumerCollectionTests.cs:507

The stacktrace of the target thread looks like this:

    frame #5: 0x0000aaaaea9e61d8 System.Collections.Concurrent.Tests`GCEvent::Impl::Wait(this=0x0000aaab044dcbf0, milliseconds=4294967295, alertable=<unavailable>) at events.cpp:149:22 [opt]
    frame #6: 0x0000aaaaea99af7c System.Collections.Concurrent.Tests`Thread::InlineSuspend(UNIX_CONTEXT*) [inlined] Thread::WaitForGC(this=0x0000ffbf097af7f0, pTransitionFrame=0xfffffffffffffffe) at thread.cpp:80:39 [opt]
    frame #7: 0x0000aaaaea99af1c System.Collections.Concurrent.Tests`Thread::InlineSuspend(this=0x0000ffbf097af7f0, interruptedContext=<unavailable>) at thread.cpp:876:5 [opt]
    frame #8: 0x0000aaaaea9e3a28 System.Collections.Concurrent.Tests`ActivationHandler(code=34, siginfo=0x0000ffbf097ad240, context=0x0000ffbf097ad2c0) at PalRedhawkUnix.cpp:1004:9 [opt]
    frame #9: 0x0000ffffbeff678c linux-vdso.so.1
    frame #10: 0x0000aaaaead33370 System.Collections.Concurrent.Tests`System.Threading.SemaphoreSlim__Wait_4(this=0x0000000000000000, millisecondsTimeout=<unavailable>, cancellationToken=<unavailable>) at SemaphoreSlim.cs:344
    frame #11: 0x0000aaaaead33370 System.Collections.Concurrent.Tests`System.Threading.SemaphoreSlim__Wait_4(this=0x0000ffbfc64132f8, millisecondsTimeout=0, cancellationToken=<unavailable>) at SemaphoreSlim.cs:344
    frame #12: 0x0000aaaaeb07e450 System.Collections.Concurrent.Tests`System_Collections_Concurrent_System_Collections_Concurrent_BlockingCollection_1<Int32>__TryTakeWithNoTimeValidation(this=0x0000ffbfc64131d0, item=0x0000ffbf097ae5f8, millisecondsTimeout=100000, cancellationToken=<unavailable>, combinedTokenSource=0x0000000000000000) at BlockingCollection.cs:663

The IP that we are enumerating the GC roots for is near the start of Monitor.Enter that is called by SemaphoreSlim.Wait.

@jkotas
Copy link
Member Author

jkotas commented May 12, 2024

@VSadov Could you please take a look?

@jkotas jkotas added the blocking-clean-ci-optional Blocking optional rolling runs label May 12, 2024
@VSadov
Copy link
Member

VSadov commented May 13, 2024

Looks like a GC hole. I run concurrent collection test with NativeAOT fairly often, so this must be something very recent.
I will take a look.

@VSadov
Copy link
Member

VSadov commented May 13, 2024

Ah these are on arm64. Not often then.
Will take a look.

@agocke agocke added this to the 9.0.0 milestone May 13, 2024
@VSadov
Copy link
Member

VSadov commented May 13, 2024

So far the test has been passing for me locally. I've tried lin-arm64 and win-arm64.

I will try the exact commit which was failing. Maybe the bug got fixed since then.

@VSadov
Copy link
Member

VSadov commented May 13, 2024

@VSadov
Copy link
Member

VSadov commented May 13, 2024

reproduced in https://dev.azure.com/dnceng-public/public/_build/results?buildId=674034&view=logs&j=2f6a7d26-0d60-5ade-d191-981fe0847989

There is a suspicious safepoint in a partially interruptible method that is not following a call instruction.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label May 14, 2024
@VSadov
Copy link
Member

VSadov commented May 15, 2024

The reason for the failure is GC-reporting of uninitialized temp.

We declare a temp to hold the address to the managed TLS blob. The temp is initialized by indirecting into native TLS. In some cases fetching the native TLS involves a method call.

Roughly it looks like:

managed_ptr tlsRoot;

// unmanaged TLS pattern starts
...
...
unmanaged_ptr nativeTLS = TLS_GET_ADDR();   
nativeTLS += someAdjustment;                           // if this is a safe point, tlsRoot must be zero-inited
// unmanaged TLS pattern ends

// only here tlsRoot gets a real value
tlsRoot = nativeTLS[some indirection];

The optimizer assumes that if we did not see a safe point between prolog and the first assignment, then zero-initing is unnecessary. The problem is that optimizer does not know that TLS access may emit a call, and that call may introduce a safe point (as calls do by default).

A simplest fix would be to not emit safepoints for calls into native TLS. They cannot participate in GC stackwalks anyways.

@VSadov
Copy link
Member

VSadov commented May 15, 2024

The reason for the "suspicious" safe point that does not follow a call is that linker replaces the TLS pattern with completely different code that does not involve calls. So the safe point ends up not trailing a call. That by itself is ok, as long as we record correct GC info we can make any instruction a safe point.

Only method calls that can participate in GC stack walk really have to be safe points.
In this case we see a call, that cannot possibly be a part of GC stackwalk. It may not even be a call after the linker relaxation. We do not need to have a safe point on this call.
It would still be ok to have a safe point here. It is not uncommon to see safepoints on calls that are not really GC-capable. We do not do very thorough analysis for that, because it is generally ok.

The only part that is not ok here is that optimizer does not know about this pattern. So it is either - teach optimizer about cases where TLS may introduce a safe point, or make TLS_GET_ADDR() not have a safe point. The latter is simpler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

4 participants