Scale up scale down can break contorl plane metastore consistency. #5013

fulmicoton · 2024-05-21T09:10:31Z

(Today we ignore metastore error)

On scale up, rebalance, and get_or_open_shards, the control plane was: - recording a shard on the metastore - writing the shard - record the shard on the control plane model. Error handling was not done, so a transport failure on the metastore, or (more likely) an error failure on init would break consistency between the metastore and the control plane. This PR factorizes the idea of opening a new shard for scale up, rebalance and get_or_open_shards. The new factorize logic goes: - init first - record in the metastore - record the shard on the control plane model. The last two steps are also factorized together to emphasize that we keep the control plane and metastore in sync. It works by forcing a restart of the control plane if the metastore returns an error for which we don't know if the write was a success or not. Closes #5008 Closes #5020 Closes #5013 test compilation

On scale up, rebalance, and get_or_open_shards, the control plane was: - recording a shard on the metastore - writing the shard - record the shard on the control plane model. Error handling was not done, so a transport failure on the metastore, or (more likely) an error failure on init would break consistency between the metastore and the control plane. This PR factorizes the idea of opening a new shard for scale up, rebalance and get_or_open_shards. The new factorize logic goes: - init first - record in the metastore - record the shard on the control plane model. The last two steps are also factorized together to emphasize that we keep the control plane and metastore in sync. It works by forcing a restart of the control plane if the metastore returns an error for which we don't know if the write was a success or not. Closes #5008 Closes #5020 Closes #5013

fulmicoton added the bug Something isn't working label May 21, 2024

fulmicoton self-assigned this May 21, 2024

fulmicoton mentioned this issue May 22, 2024

Divergence of the number of shards in the metastore vs as seen from the control plane #5008

Closed

fulmicoton mentioned this issue May 26, 2024

Issue/5020 cp metastore inconsistency shard init #5029

Merged

fulmicoton closed this as completed in #5029 May 30, 2024

fulmicoton closed this as completed in 3916104 May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale up scale down can break contorl plane metastore consistency. #5013

Scale up scale down can break contorl plane metastore consistency. #5013

fulmicoton commented May 21, 2024 •

edited

Scale up scale down can break contorl plane metastore consistency. #5013

Scale up scale down can break contorl plane metastore consistency. #5013

Comments

fulmicoton commented May 21, 2024 • edited

fulmicoton commented May 21, 2024 •

edited