Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster #3593

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Cluster #3593

wants to merge 17 commits into from

Conversation

irevoire
Copy link
Member

@irevoire irevoire commented Mar 16, 2023

How does it work?

In this first implementation, we went on a leader/follower approach with a pre-selected leader that can't change.
The followers only follow the order of the leader but allow read.
And the leader is in charge of replicating all the writes to the followers and itself.

Processing a task

The leader will send the tasks to process to the follower.
Then, after indexing everything but right before committing the changes on disk, it'll wait for the state of the follower.
At the same time, the followers get the batch to process from the leader and also wait before committing.
Depending on the consistency rule, the leader might tell them to commit right away or later.

If the consistency has been set to;

  • one: the leader will tell everyone to commit without waiting for any followers.
  • two: the leader will wait for one follower to be ready to commit before telling everyone to commit and moving on
  • quorum: the leader will wait until more than half of the cluster is ready to commit
  • all: the leader will wait until all the followers are ready to commit.

Not implemented yet: If a follower doesn't get the same result as the leader, it should either:

  • kill itself
  • don't commit but continue to accept reads (it's going to be outdated)

Joining the cluster

When a node joins the cluster it won't be active straight away.
The leader will accept the connection with the follower, but it'll wait until the current task has been processed.
And in between two tasks, all the followers will « officially » join the cluster (we say they become active).
To share the leader's state with the new followers, it'll create a dump and send it to the followers so they can update themselves to the current state of the cluster.

The leaders and followers must share the same master key.
If that's not the case, the follower won't be able to join the cluster.
Also: the connections between the leader and followers are encrypted with chacha20 and the master key; thus, it's recommended to have a secure autogenerated master key of at least 32 bytes.

Synchronizing the API key

The leader forwards the API key operations to every follower, and it's updated ASAP without synchronizing anything.

What new API pieces have been introduced:

  • CLI:
    • A new --experimental-enable-ha <EXPERIMENTAL_ENABLE_HA> flag has been introduced. Its values are either leader or follower.
    • A new --leader <LEADER> flag has been introduced. It lets you specify the address of the leader, and it's mandatory if you're a follower
    • A new --consistency <CONSISTENCY> flag has been introduced to configure the consistency rules. Its possible values are:
      • one => The leader progress as fast as possible
      • two => The leader + one node are in sync
      • quorum => The majority of the cluster stays synchronized
      • all => The whole cluster stays in sync

What is utterly broken/ugly currently and should be rewritten / handled correctly

  • 1️⃣ The TCP connection used between the leader and the followers doesn't have the keepalive option enabled. Thus the connections are probably going to die often.
  • 1️⃣ Add an internal interface
  • 1️⃣ The tasks received while joining the cluster might be lost
  • 1️⃣ Handle what happens when a follower doesn’t get the same result the leader got from processing a batch
  • 1️⃣ What happens when the tick function need to re-run (cancel + MaxDatabaseSizeReached). For the potential users reading us, I think the whole cluster might get stuck for ever. If you have a « normal » Linux machine that should never happens though
  • 1️⃣ Currently, the only sync we have is made on the index operation
  • 2️⃣ Don’t truncate the master-key when starting the cluster -> handle the error when the master-key is wrong / we can’t connect
  • 2️⃣ When we send a task (with its update file) or a dump (to let the followers join the cluster), it must be stored entirely in RAM, which definitely won't scale on a small computer
  • 2️⃣ The followers are still able to receive writes (tasks or API keys), and I don't exactly know what happens in this case, but it's definitely nothing good
  • 3️⃣ We spawn like 200 threads that could all be a super small async rust routine that doesn't costs anything
  • 3️⃣ It doesn't work on windows
  • 3️⃣ The instance-uid should be shared for the whole cluster

Below are tamo's notes, don't try to understand anything.

  • Make the consistency configurable at the task level
  • Synchronize the API key
  • Synchronize the instance uid, maybe?

@github-actions
Copy link

github-actions bot commented Mar 21, 2023

Uffizzi Ephemeral Environment deployment-19791

☁️ https://app.uffizzi.com/github.com/meilisearch/meilisearch/pull/3593

📄 View Application Logs etc.

The meilisearch preview environment contains a web terminal from where you can run the
meilisearch command. You should be able to access this instance of meilisearch running in
the preview from the link Meilisearch Endpoint link given below.

Web Terminal Endpoint :
Meilisearch Endpoint : /meilisearch

@curquiza curquiza linked an issue Mar 30, 2023 that may be closed by this pull request
@Kerollmops
Copy link
Member

Kerollmops commented Jul 6, 2023

Hey 👋

We investigated this solution and wanted to look at what we can have and fix with a raft-based implementation.

1️⃣ The TCP connection used between the leader and the followers doesn't have the keepalive option enabled. Thus the connections are probably going to die often.

The Raft library could fix it, as we are able to use whatever we want as the network connection protocol. In the basic example they are using an HTTP connection with reqwest.

1️⃣ Add an internal interface

If you are talking about the public and private IPs in a cluster, it should be possible to have two interfaces with Raft.

1️⃣ The tasks received while joining the cluster might be lost

It should be ok with Raft. I hope. We must investigate that.

1️⃣ Handle what happens when a follower doesn’t get the same result the leader got from processing a batch

We must make sure we can express that a node is in the state. Returning that a node is non-healthy from the /health route, maybe?

1️⃣ What happens when the tick function need to re-run (cancel + MaxDatabaseSizeReached). For the potential users reading us, I think the whole cluster might get stuck for ever. If you have a « normal » Linux machine that should never happens though

It should be hard to trigger now that the index size is dynamic.

1️⃣ Currently, the only sync we have is made on the index operation

If you mean that the commit is only made on the index level, not the index-scheduler level, e.g., index swap, we can improve it later. If you mean that the API Keys are not supported, it is just a matter of time and effort.

2️⃣ When we send a task (with its update file) or a dump (to let the followers join the cluster), it must be stored entirely in RAM, which definitely won't scale on a small computer

It can be stored on disk in the future.

2️⃣ The followers are still able to receive writes (tasks or API keys), and I don't exactly know what happens in this case, but it's definitely nothing good

We must refuse direct API calls when not the leader.

3️⃣ We spawn like 200 threads that could all be a super small async rust routine that doesn't costs anything

OpenRaft is async and uses tokio. We could even maybe share the same tokio runtime as the one we use with Actix 😍

3️⃣ It doesn't work on Windows.

It should be ok with OpenRaft. That's a matter of Network and disk operations.

Be able to compute the same batches of tasks as the other nodes

We must ensure that it is possible with OpenRaft to define the tasks we can process.

Only commit when the quorum is ready to commit (direct consistency, not eventual consistency)

It should be possible, but we must look at what we can do about it.

Manage the healthiness of a node based on the number of tasks to catch up

We could look at the number of tasks to process to catch up with the number of tasks processed by the leader and change our healthiness based on this number.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

About replicating Meilisearch
2 participants