transport: reset the node up/down marker on topology update #607

havaker · 2022-11-25T15:00:36Z

For node up/down markers to work reliably, a source of truth other than CQL events is needed. That's because the CQL events delivery is prone to connection failures. By quering system.cluster_status table from time to time, we are able to get the true information about the status of nodes in the cluster.

This pull request implements this quering and resets each node up/down marker value to the information received from the periodic quering.

Pre-review checklist

I have split my patch into logically separate commits.
All commit messages clearly explain what they change and why.
I added relevant tests for new features and bug fixes.
All commits compile, pass static checks and pass test.
PR description sums up the changes and reasons why they should be introduced.
I added appropriate Fixes: annotations to PR description.

For node up/down markers to work reliably, a source of truth other than CQL events is needed. That's because the CQL events delivery is prone to connection failure. By quering `system.cluster_status` table from time to time, we are able to get the true information about the status of nodes in the cluster. This change will enable us to correct the up/down markers because every `Node` is created (and also can be updated) based on information from a corresponding `Peer`.

Done to match marker name with `system.cluster_status` column name - `up`.

Node up markers should be reset according to the information from `Metadata` - we recognize it as a source of truth about the cluster (in contrast to the information received via events, whose delivery is unreliable).

havaker · 2022-11-25T16:21:07Z

Oops, looks like system.cluster_status appeared in ScyllaDB only recently (scylladb/scylladb@7c95bd3), and it does not exists at all in Cassandra.

In the absence of system.cluster_status table, we will be forced to use another way to detect down nodes. I think that periodically sending a keepalive query to each node would be a good substitute - the infrastructure to do so already exists (#395), but the question is how can it be used.

As for now, I'm converting this PR into a draft.

havaker added 3 commits November 25, 2022 00:24

transport: node: rename down_marker to up_marker

8296af2

Done to match marker name with `system.cluster_status` column name - `up`.

transport: cluster: reset node up markers on ClusterData creation

ab08615

Node up markers should be reset according to the information from `Metadata` - we recognize it as a source of truth about the cluster (in contrast to the information received via events, whose delivery is unreliable).

havaker marked this pull request as draft November 25, 2022 16:21

havaker mentioned this pull request Mar 22, 2023

Enable filtering down nodes in scylla::transport::load_balancing::DefaultPolicy #675

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transport: reset the node up/down marker on topology update #607

transport: reset the node up/down marker on topology update #607

havaker commented Nov 25, 2022 •

edited

havaker commented Nov 25, 2022

transport: reset the node up/down marker on topology update #607

Are you sure you want to change the base?

transport: reset the node up/down marker on topology update #607

Conversation

havaker commented Nov 25, 2022 • edited

Pre-review checklist

havaker commented Nov 25, 2022

havaker commented Nov 25, 2022 •

edited