Limit shard snapshot work-in-progress #108739
Labels
>bug
:Distributed/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
Team:Distributed
Meta label for distributed team
The pause-on-shutdown mechanism introduced in #101717 in fact resets every affected shard-level snapshot such that they retry again from the beginning on their new node(s), discarding the work-in-progress uploads of each shard. Today we make no attempt to limit the amount of work-in-progress that might be discarded on a node shutdown, since we interleave the uploads of the files from every shard that is being snapshotted. This can mean that the discarded work can be very substantial (in one case we observed it to set the overall snapshot progress back by over 10TiB).
We should find some way to limit this WIP, focussing on completing individual shard snapshots sooner, to reduce the effects of a shutdown mid-snapshot.
The text was updated successfully, but these errors were encountered: