Limit shard snapshot work-in-progress #108739

DaveCTurner · 2024-05-16T18:49:01Z

The pause-on-shutdown mechanism introduced in #101717 in fact resets every affected shard-level snapshot such that they retry again from the beginning on their new node(s), discarding the work-in-progress uploads of each shard. Today we make no attempt to limit the amount of work-in-progress that might be discarded on a node shutdown, since we interleave the uploads of the files from every shard that is being snapshotted. This can mean that the discarded work can be very substantial (in one case we observed it to set the overall snapshot progress back by over 10TiB).

We should find some way to limit this WIP, focussing on completing individual shard snapshots sooner, to reduce the effects of a shutdown mid-snapshot.

elasticsearchmachine · 2024-05-16T18:49:25Z

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added >bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels May 16, 2024

elasticsearchmachine added the Team:Distributed Meta label for distributed team label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit shard snapshot work-in-progress #108739

Limit shard snapshot work-in-progress #108739

DaveCTurner commented May 16, 2024

elasticsearchmachine commented May 16, 2024

Limit shard snapshot work-in-progress #108739

Limit shard snapshot work-in-progress #108739

Comments

DaveCTurner commented May 16, 2024

elasticsearchmachine commented May 16, 2024