Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Malicious Stream Causing GPU Crashes and Node Disruptions #2970

Open
FranckUltima opened this issue Mar 5, 2024 · 6 comments
Open

Malicious Stream Causing GPU Crashes and Node Disruptions #2970

FranckUltima opened this issue Mar 5, 2024 · 6 comments
Labels
status: triage this issue has not been evaluated yet

Comments

@FranckUltima
Copy link

Several orchestrators have noticed, in recent weeks, the existence of one or more streams causing GPU crashes (cuda error) which leads to the loss of all running streams, and even to the crash of the transcoder/node. The stream then switches to another orchestrator, causing the same results.

I am concerned about the impact of this kind of event on the quality of the network, and the possibility that such streams could also be voluntarily injected into the network.

Following the discussion yesterday evening on the watercooler chat, Doug considered the possibility that the Broadcaster node could identify the stream and delete it automatically to prevent it from spreading, but if the stream was injected voluntarily, it might be better to be able to detect it and block it at the orchestrator level.

I have not been able to identify the responsible stream, and this event remains rather rare, but when it happens, several orchestrators usually report it on Discord.

@github-actions github-actions bot added the status: triage this issue has not been evaluated yet label Mar 5, 2024
@thomshutt
Copy link
Contributor

@FranckUltima I'd assume for now that it's a problem with the stream rather than something malicious but let's try to get to the bottom of it as soon as w e can. Please update the ticket once you have more details

@eliteprox
Copy link
Contributor

Here's logs for a stream that crashed both transcoder processes behind my orchestrator

Transcoder

[AVHWDeviceContext @ 0x7f4ae3d9c600] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_NOT_PERMITTED: operation not permitted
ERROR: decoder.c:313] Unable to open hardware context for decoding : Unknown error occurred
ERROR: decoder.c:348] Unable to open video decoder : Error number -1448234581 occurred
E0306 06:53:47.127451       1 ffmpeg.go:986] Transcoder Return : Unrecoverable state, restart process
E0306 06:53:47.127523       1 ot_rpc.go:197] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c taskId=560581 Transcoding done for taskId=560581 url=https://100.4.105.177:8933/stream/dffd916c/13738.tempfile dur=45.193225ms err="Unrecoverable state, restart process"
E0306 06:53:47.127537       1 ot_rpc.go:252] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c taskId=560581 Unable to transcode err="Unrecoverable state, restart process"
panic: Unrecoverable state, restart process

goroutine 554202 [running]:
github.com/livepeer/go-livepeer/server.runTranscode.func2()
        /src/server/ot_rpc.go:200 +0x2a
github.com/livepeer/go-livepeer/server.runTranscode(0xc000254820, {0xc00060e008, 0x19}, 0xc00081c000?, 0xc000730540)
        /src/server/ot_rpc.go:203 +0x1002
github.com/livepeer/go-livepeer/server.runTranscoder.func2()
        /src/server/ot_rpc.go:135 +0x36
created by github.com/livepeer/go-livepeer/server.runTranscoder
        /src/server/ot_rpc.go:134 +0x7ab

Orchestrator

E0306 06:53:46.476627       1 ot_rpc.go:361] Transcoding error for taskId=560580 err="Unrecoverable state, restart process"
E0306 06:53:46.476665       1 orchestrator.go:846] manifestID=manifest-chlrgw18 seqNo=0 orchSessionID=8aa951b6 clientIP=79.127.219.2 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=172.18.0.1:49808 segments=0 taskId=560580 fname=https://100.4.105.177:8933/stream/8aa951b6/0.tempfile dur=29.021356ms err="Unrecoverable state, restart process"
E0306 06:53:46.476676       1 orchestrator.go:592] manifestID=manifest-chlrgw18 seqNo=0 orchSessionID=8aa951b6 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E clientIP=79.127.219.2 Error transcoding segName=https://100.4.105.177:8933/stream/8aa951b6/0.tempfile err="Unrecoverable state, restart process"
E0306 06:53:46.476818       1 segment_rpc.go:230] manifestID=manifest-chlrgw18 seqNo=0 orchSessionID=8aa951b6 clientIP=79.127.219.2 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="Unrecoverable state, restart process"
E0306 06:53:46.612836       1 orchestrator.go:927] Stream closed for transcoder=172.18.0.1:49808, err="context canceled"
I0306 06:53:46.612858       1 orchestrator.go:945] Got transcoder=172.18.0.1:49808 eof, removing from live transcoders map
E0306 06:53:47.127783       1 ot_rpc.go:361] Transcoding error for taskId=560581 err="Unrecoverable state, restart process"
E0306 06:53:47.127835       1 orchestrator.go:846] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c clientIP=79.127.199.49 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Successfully received results from remote transcoder=172.18.0.1:49792 segments=0 taskId=560581 fname=https://100.4.105.177:8933/stream/dffd916c/13738.tempfile dur=53.993053ms err="Unrecoverable state, restart process"
E0306 06:53:47.127848       1 orchestrator.go:592] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E clientIP=79.127.199.49 Error transcoding segName=https://100.4.105.177:8933/stream/dffd916c/13738.tempfile err="Unrecoverable state, restart process"
E0306 06:53:47.128030       1 segment_rpc.go:230] manifestID=c632474d-50b3-4717-bbfc-03d7994f9fc4 seqNo=13738 orchSessionID=dffd916c clientIP=79.127.199.49 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Could not transcode err="Unrecoverable state, restart process"
E0306 06:53:47.276797       1 orchestrator.go:927] Stream closed for transcoder=172.18.0.1:49792, err="context canceled"

@papabear99
Copy link
Contributor

Scrolling back through my logs, I found this on my NYC combined OT (running in docker) that caused the container to restart

I0305 02:46:39.839962       1 orchestrator.go:612] manifestID=ac901a60-bab6-4dd9-9a59-15626f3ca54c seqNo=28159 orchSessionID=26e483d7 clientIP=79.127.221.98 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E Transcoded segment profile=720p bytes=147392
[AVHWDeviceContext @ 0x7f8d0d453c00] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_NOT_PERMITTED: operation not permitted
ERROR: decoder.c:313] Unable to open hardware context for decoding : Unknown error occurred
ERROR: decoder.c:348] Unable to open video decoder : Error number -1448234581 occurred
E0305 02:46:39.860655       1 ffmpeg.go:990] Transcoder Return : Unrecoverable state, restart process
I0305 02:46:39.862226       1 lb.go:216] manifestID=manifest-ob5vqh6p seqNo=979 orchSessionID=f471acc1 clientIP=79.127.219.9 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Stopping transcoder due to error for key=f471acc1_0
I0305 02:46:39.862265       1 lb.go:139] manifestID=manifest-ob5vqh6p seqNo=979 orchSessionID=f471acc1 clientIP=79.127.219.9 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Deleted transcode session for key=f471acc1_0
panic: Unrecoverable state, restart process
goroutine 15397716 [running]:
github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSeg(0xc000176000, {0x89d2d0, 0xc00ba64840}, {{0x8a39d8?, 0xc00165f340?}, {0x8a39d8?, 0xc00165f340?}}, 0xc00165f300, 0xc00f5809a0)
	/src/core/orchestrator.go:579 +0xb50
github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop.func1()
	/src/core/orchestrator.go:671 +0x9b
created by github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop
	/src/core/orchestrator.go:656 +0x47a

@Tibia2000
Copy link

Tibia2000 commented Mar 7, 2024

Here are mine from some time ago:

Feb 20 17:05:56  livepeer[2796953]: I0220 17:05:56.699107 2796953 lb.go:123] manifestID=369082d3-0c72-4a77-8169-bb462bbbd794 seqNo=12248 orchSessionID=6d5f998a clientIP=143.244.60.233 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Deleted transcode session for key=6d5f998a_0
Feb 20 17:05:56  livepeer[2796953]: panic: Unrecoverable state, restart process
Feb 20 17:05:56  livepeer[2796953]: goroutine 1925973 [running]:
Feb 20 17:05:56  livepeer[2796953]: github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSeg(0xc000192680, {0x8c26b0, 0xc000d493b0}, {{0x8c8e10?, 0xc0009bed00?}, {0x8c8e10?, 0xc0009bed00?}}, 0xc0009becc0, 0xc001664480)
Feb 20 17:05:56 livepeer[2796953]:         /__w/go-livepeer/go-livepeer/core/orchestrator.go:590 +0xb45
Feb 20 17:05:56  livepeer[2796953]: github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop.func1()
Feb 20 17:05:56  livepeer[2796953]:         /__w/go-livepeer/go-livepeer/core/orchestrator.go:682 +0x9b
Feb 20 17:05:56 livepeer[2796953]: created by github.com/livepeer/go-livepeer/core.(*LivepeerNode).transcodeSegmentLoop
Feb 20 17:05:56  livepeer[2796953]:         /__w/go-livepeer/go-livepeer/core/orchestrator.go:667 +0x41f````

@Tibia2000
Copy link

Tibia2000 commented Mar 12, 2024

USA again:

Mar 12 18:39:46  livepeer[1406531]: [AVHWDeviceContext @ 0x7f44f77524c0] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Mar 12 18:39:46  livepeer[1406531]: ERROR: decoder.c:313] Unable to open hardware context for decoding : Unknown error occurred
Mar 12 18:39:46  livepeer[1406531]: ERROR: decoder.c:348] Unable to open video decoder : Error number -1448234581 occurred
Mar 12 18:39:46  livepeer[1406531]: E0312 18:39:46.694127 1406531 ffmpeg.go:986] Transcoder Return : Unrecoverable state, restart process
Mar 12 18:39:46  livepeer[1406531]: I0312 18:39:46.694208 1406531 lb.go:200] manifestID=ecd66igl6cduxdga seqNo=6 orchSessionID=c0ff46a0 clientIP=143.244.61.193 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Stopping transcoder due to error for key=c0ff46a0_0```

@Tibia2000
Copy link

Mar 12 18:31:18  livepeer[1057068]: [AVHWDeviceContext @ 0x7fc6dd293100] cu->cuCtxCreate(&hwctx->cuda_ctx, desired_flags, hwctx->internal->cuda_device) failed -> CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Mar 12 18:31:18  livepeer[1057068]: ERROR: decoder.c:313] Unable to open hardware context for decoding : Unknown error occurred
Mar 12 18:31:18  livepeer[1057068]: ERROR: decoder.c:348] Unable to open video decoder : Error number -1448234581 occurred
Mar 12 18:31:18  livepeer[1057068]: E0312 18:31:18.408003 1057068 ffmpeg.go:986] Transcoder Return : Unrecoverable state, restart process
Mar 12 18:31:18  livepeer[1057068]: I0312 18:31:18.408473 1057068 lb.go:200] manifestID=53dcd30f-d785-460e-8382-b0accc7e0db4 seqNo=3811 orchSessionID=3c2de895 clientIP=79.127.199.50 sender=0xc3c7c4C8f7061B7d6A72766Eee5359fE4F36e61E LB: Stopping transcoder due to error for key=3c2de895_0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: triage this issue has not been evaluated yet
Projects
None yet
Development

No branches or pull requests

5 participants