You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Multi-stream lets us get more out of the gpu by overlapping requests - even with only 2 search threads on a single gpu, it can give a bonus, and sometimes 3 search threads may even start to make sense.
Demux backend implementation gets in the way of this. Each search thread request gets split into parts to exactly cover the number of worker threads servicing the real backends - so if another search thread comes along all workers should be expected to blocked and thus no overlapping occurs.
It is possible to 'force' overlapping by increasing minibatch size and doubling the number of workers that demux backend creates, but this makes assumptions about the ability to gather batches being efficient, and probably misses out on 'continuous overlapping' so there is probably a small amount of performance left on the table, and the batch size required may be excess to actually needed in practice.
If demux instead split tasks into separate pools per gpu, then having the threads-per-gpu setting equal to search threads would be a much closer experience to what a single gpu gets with multi-stream.
Possibly this should be a completely separate new backend, as the per-gpu pool logic is quite different from how demux currently works and may affect some esoteric use cases that current demux supports.
The text was updated successfully, but these errors were encountered:
I think what is needed for demux to work OK with multi-stream is to set the minimum-split-size to the expected batch size per GPU. Then increasing the demux threads will not reduce the batch size used for each GPU.
possibly better than what is possible without minimum-split-size, but nothing stops two workers from the same gpu picking up the two splits rather than giving them equally to each gpu, so I still think we can do better.
Multi-stream lets us get more out of the gpu by overlapping requests - even with only 2 search threads on a single gpu, it can give a bonus, and sometimes 3 search threads may even start to make sense.
Demux backend implementation gets in the way of this. Each search thread request gets split into parts to exactly cover the number of worker threads servicing the real backends - so if another search thread comes along all workers should be expected to blocked and thus no overlapping occurs.
It is possible to 'force' overlapping by increasing minibatch size and doubling the number of workers that demux backend creates, but this makes assumptions about the ability to gather batches being efficient, and probably misses out on 'continuous overlapping' so there is probably a small amount of performance left on the table, and the batch size required may be excess to actually needed in practice.
If demux instead split tasks into separate pools per gpu, then having the threads-per-gpu setting equal to search threads would be a much closer experience to what a single gpu gets with multi-stream.
Possibly this should be a completely separate new backend, as the per-gpu pool logic is quite different from how demux currently works and may affect some esoteric use cases that current demux supports.
The text was updated successfully, but these errors were encountered: