Batch size too large will crash pipeline with 500 errors #49

ad-astra-video · 2024-04-09T02:33:14Z

Describe the bug

If a batch size too large for GPU is requested the container will start throwing 500 errors. I believe the container is not crashed on this instance but is bad experience for user.

Reproduction steps

Send text-to-image request with batch size of 10 to Bytedance/SDXL-Lightning model
ai-runner container will start throwing 500 errors as the retries continue if on RTX 4090 or lower VRAM gpu.

Expected behaviour

O/T should be able to specify max batch size with a default of 1. Any batch size above the set max batch size is run sequentially in the ai-runner to get the requested batch of images.

For ByteDance/SDXL-Lightning the processing of a batch of images in one request is linear in time it takes to process the same number of images sequentially (1 takes 700ms, 3 takes 2.1s when batched together). Its not exact but its really close so the user experience would be very similar with a batch or sequentially run image generation. I don't expect this to be the case for all models though. Testing and experience would drive how this works model to model but a batch size of 1 should be a safe start.

For some requests it may make sense to start returning images or making them available for download by the B as they are processed. For fast models like ByteDance/SDXL-Lightning, probably does not need to be concerned with that.

It could also be argued, if tickets are sized to the pixels requested, the batch request should be split between multiple Os to get the batch done faster.

Severity

None

Screenshots / Live demo link

No response

OS

Linux

Running on

Docker

AI-worker version

latest (alpha testnet)

Additional context

No response

rickstaa · 2024-05-08T12:07:03Z

Tracked internally at https://linear.app/livepeer-ai-spe/issue/LIV-172.

ad-astra-video added the bug Something isn't working label Apr 9, 2024

ad-astra-video mentioned this issue Apr 21, 2024

Process text-to-image requested image count sequentially #66

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size too large will crash pipeline with 500 errors #49

Batch size too large will crash pipeline with 500 errors #49

ad-astra-video commented Apr 9, 2024

rickstaa commented May 8, 2024

Batch size too large will crash pipeline with 500 errors #49

Batch size too large will crash pipeline with 500 errors #49

Comments

ad-astra-video commented Apr 9, 2024

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

rickstaa commented May 8, 2024