Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: go sometimes hangs under qemu-arm-static since 1.21 #67355

Closed
martinetd opened this issue May 14, 2024 · 4 comments
Closed

runtime: go sometimes hangs under qemu-arm-static since 1.21 #67355

martinetd opened this issue May 14, 2024 · 4 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.

Comments

@martinetd
Copy link

Go version

go version devel go1.23-8623c0ba95 Mon May 13 21:47:29 2024 +0000 linux/arm

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='arm'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/goroot'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/goroot/pkg/tool/linux_arm'
GOVCS=''
GOVERSION='devel go1.23-8623c0ba95 Mon May 13 21:47:29 2024 +0000'
GODEBUG=''
GCCGO='gccgo'
GOARM='7,hardfloat'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -marm -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build4253476286=/tmp/go-build -gno-record-gcc-switches'

What did you do?

  1. setup qemu user, e.g. apt install qemu-user-static binfmt on debian
  2. $ docker run --rm -ti docker.io/arm32v7/alpine:latest
  3. / # apk add go
  4. / # for i in $(seq 1 100); do go env GOPATH; done

What did you see happen?

go gets stuck about 1 in 10 runs

(I wasn't sure if I should have commented on #1508 but it's a new regression so figured it'd be easier to discuss separately)

Some more facts:

  • using binaries from https://go.dev/dl/ on debian, go 1.21.10 and 1.22.3 reproduce this; go 1.20.14 doesn't. I tried building from sources but I reproduce with 1.20.14 on a cross build, I'll try to build differently tomorrow...
  • doesn't happen while pinning a cpu e.g. running under taskset -c 1
  • doesn't happen in qemu-system-arm with multiple cpus
  • (there might be a segfault from time to time, that isn't new with this version, but that's both much rarer and easier to work around than a hang for me)
  • Unfortunately I cannot get a stacktrace of where exactly it is stuck (delve doesn't support linux/arm, and I cannot get gdb to work nor reproduce with strace...); sigquit produces the following:
^\SIGQUIT: quit
PC=0x95d74 m=1 sigcode=128

goroutine 0 gp=0xc02368 m=1 mp=0xc62008 [idle]:
runtime.usleep(0x2710)
	/goroot/src/runtime/sys_linux_arm.s:577 +0x2c fp=0xc25f7c sp=0xc25f6c pc=0x95d74
runtime.sysmon()
	/goroot/src/runtime/proc.go:5906 +0xd0 fp=0xc25fd8 sp=0xc25f7c pc=0x63b34
runtime.mstart1()
	/goroot/src/runtime/proc.go:1773 +0x7c fp=0xc25fe8 sp=0xc25fd8 pc=0x59750
runtime.mstart0()
	/goroot/src/runtime/proc.go:1730 +0x7c fp=0xc25ffc sp=0xc25fe8 pc=0x596c4
runtime.mstart()
	/goroot/src/runtime/asm_arm.s:210 +0x8 fp=0xc26000 sp=0xc25ffc pc=0x92f54

goroutine 1 gp=0xc02128 m=0 mp=0xb322e0 [running, locked to thread]:
	goroutine running on other thread; stack unavailable

goroutine 2 gp=0xc02488 m=nil [force gc (idle)]:
runtime.gopark(0x744abc, 0xb302c8, 0x11, 0xa, 0x1)
	/goroot/src/runtime/proc.go:401 +0x104 fp=0xc5efd4 sp=0xc5efc0 pc=0x561f0
runtime.goparkunlock(...)
	/goroot/src/runtime/proc.go:407
runtime.forcegchelper()
	/goroot/src/runtime/proc.go:325 +0xe4 fp=0xc5efec sp=0xc5efd4 pc=0x5602c
runtime.goexit({})
	/goroot/src/runtime/asm_arm.s:884 +0x4 fp=0xc5efec sp=0xc5efec pc=0x94b7c
created by runtime.init.5 in goroutine 1
	/goroot/src/runtime/proc.go:313 +0x1c

trap    0x0
error   0x0
oldmask 0x0
r0      0xfffffffc
r1      0x0
r2      0x0
r3      0x2710
r4      0x0
r5      0x0
r6      0x0
r7      0xa2
r8      0x0
r9      0x0
r10     0xc02368
fp      0x2710
ip      0xc02368
sp      0xc25f6c
lr      0x95d54
pc      0x95d74
cpsr    0x20000010
fault   0x0

-- can we get useful out of that pc perhaps?

I'm honestly not familiar with go at all, but I can probably find my way - I'll probably keep digging a bit trying to build a "good" version so I can bisect, but if there's a more straightforward way to get the running stack trace I'll be happy to try messing around that part of the code.

Thanks!

What did you expect to see?

go doesn't get stuck.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label May 14, 2024
@mauri870
Copy link
Member

In my experience QEMU was always very buggy, we also don't have a builder setup for qemu so we can't make any promises about compatibility. See the porting policy.

I'm not sure if we can do anything about this.

cc @golang/runtime

@martinetd
Copy link
Author

qemu has made real progress, and while it's true there were many bugs when I previously had used it ~10 years ago, these past few years the only problems I see are with go, and until 1.21 they were easy to ignore (rare, and while ugly just shoving the whole build script in a retry loop was good enough)
This regression is much more likely to happen and requires adding timeouts in various places so I'd rather try to fix it instead.

I'm not asking for stability promises or any fix here, just some direction to help debug this -- it's entirely possible the fix would go into qemu side, but I can't do anything if I don't understand what goes wrong in the first place...

I guess that since I can build a broken version I can sparkle print statements and see where it hangs, but given it looks like a race and I can't reproduce with strace it's likely print statements will also make it go away.
So specifically, regardless of qemu:

  • with delve unavailable (not supporting linux/arm), is there a way to get that "goroutine running on other thread; stack unavailable" stack?
  • is there any low-overhead tracing that'd work on arm?

Thanks!

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 14, 2024
@martinetd
Copy link
Author

Grmpf. A coworker reported another problem with our armv7 builder and upon further inspection the binfmt entry for qemu-arm was weird (as in, different from some other debian system when I couldn't reproduce the unsupported syscall that came up in new error):

enabled
interpreter /usr/bin/qemu-arm-static
flags: OCF
offset 0
magic 7f454c4601010100000000000000000002002800
mask ffffffffffffff00fffffffffffffffffeffffff

instead of:

enabled
interpreter /usr/libexec/qemu-binfmt/arm-binfmt-P
flags: POCF
offset 0
magic 7f454c4601010100000000000000000002002800
mask ffffffffffffff00fffffffffffffffffeffffff

(P flag to preserve argv0 of the emulated binary and interpreter path changed)

After fixing this I noticed qemu-arm-static has a -strace flag and verious trace options so I wanted to reproduce invoking qemu-user directly but noticed I couldn't reproduce... Tried again from the same container and the bug is gone.

So that would be an issue related to how qemu is registered, I guess? And I got confused because the go version update matches the update in alpine from our last builds when I couldn't reproduce this, and older go didn't exhibit the hang, so this was all weird...

Either way, go's change of behavior probably really exists and could likely be tracked down if I had time to get back to the old binfmt registered back, but I'll skip this exercise for now -- I'll reopen this or open a new issue if something similar turns up. Thanks for the quick reply anyhow!

(by the way, the segfault I was seeing occasionally also seemingly stopped reproducing, so qemu-user must have improved in the past couple of years and actually does even better than I was expecting now it's properly setup...)

@prattmic
Copy link
Member

is there a way to get that "goroutine running on other thread; stack unavailable" stack?

For future reference, if you set GOTRACEBACK=crash, the signal handler will explicitly send a signal to every thread to dump its own stack, which should get you a stack from the running goroutine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

5 participants