Cluster #3593

irevoire · 2023-03-16T18:05:07Z

How does it work?

In this first implementation, we went on a leader/follower approach with a pre-selected leader that can't change.
The followers only follow the order of the leader but allow read.
And the leader is in charge of replicating all the writes to the followers and itself.

Processing a task

The leader will send the tasks to process to the follower.
Then, after indexing everything but right before committing the changes on disk, it'll wait for the state of the follower.
At the same time, the followers get the batch to process from the leader and also wait before committing.
Depending on the consistency rule, the leader might tell them to commit right away or later.

If the consistency has been set to;

one: the leader will tell everyone to commit without waiting for any followers.
two: the leader will wait for one follower to be ready to commit before telling everyone to commit and moving on
quorum: the leader will wait until more than half of the cluster is ready to commit
all: the leader will wait until all the followers are ready to commit.

Not implemented yet: If a follower doesn't get the same result as the leader, it should either:

kill itself
don't commit but continue to accept reads (it's going to be outdated)

Joining the cluster

When a node joins the cluster it won't be active straight away.
The leader will accept the connection with the follower, but it'll wait until the current task has been processed.
And in between two tasks, all the followers will « officially » join the cluster (we say they become active).
To share the leader's state with the new followers, it'll create a dump and send it to the followers so they can update themselves to the current state of the cluster.

The leaders and followers must share the same master key.
If that's not the case, the follower won't be able to join the cluster.
Also: the connections between the leader and followers are encrypted with chacha20 and the master key; thus, it's recommended to have a secure autogenerated master key of at least 32 bytes.

Synchronizing the API key

The leader forwards the API key operations to every follower, and it's updated ASAP without synchronizing anything.

What new API pieces have been introduced:

CLI:
- A new --experimental-enable-ha <EXPERIMENTAL_ENABLE_HA> flag has been introduced. Its values are either leader or follower.
- A new --leader <LEADER> flag has been introduced. It lets you specify the address of the leader, and it's mandatory if you're a follower
- A new --consistency <CONSISTENCY> flag has been introduced to configure the consistency rules. Its possible values are:
  - one => The leader progress as fast as possible
  - two => The leader + one node are in sync
  - quorum => The majority of the cluster stays synchronized
  - all => The whole cluster stays in sync

What is utterly broken/ugly currently and should be rewritten / handled correctly

1️⃣ The TCP connection used between the leader and the followers doesn't have the keepalive option enabled. Thus the connections are probably going to die often.
1️⃣ Add an internal interface
1️⃣ The tasks received while joining the cluster might be lost
1️⃣ Handle what happens when a follower doesn’t get the same result the leader got from processing a batch
1️⃣ What happens when the tick function need to re-run (cancel + MaxDatabaseSizeReached). For the potential users reading us, I think the whole cluster might get stuck for ever. If you have a « normal » Linux machine that should never happens though
1️⃣ Currently, the only sync we have is made on the index operation
2️⃣ Don’t truncate the master-key when starting the cluster -> handle the error when the master-key is wrong / we can’t connect
2️⃣ When we send a task (with its update file) or a dump (to let the followers join the cluster), it must be stored entirely in RAM, which definitely won't scale on a small computer
2️⃣ The followers are still able to receive writes (tasks or API keys), and I don't exactly know what happens in this case, but it's definitely nothing good
3️⃣ We spawn like 200 threads that could all be a super small async rust routine that doesn't costs anything
3️⃣ It doesn't work on windows
3️⃣ The instance-uid should be shared for the whole cluster

Below are tamo's notes, don't try to understand anything.

Make the consistency configurable at the task level
Synchronize the API key
Synchronize the instance uid, maybe?

github-actions · 2023-03-21T17:55:07Z

Uffizzi Ephemeral Environment `deployment-19791`

☁️ https://app.uffizzi.com/github.com/meilisearch/meilisearch/pull/3593

📄 View Application Logs etc.

The meilisearch preview environment contains a web terminal from where you can run the
meilisearch command. You should be able to access this instance of meilisearch running in
the preview from the link Meilisearch Endpoint link given below.

Web Terminal Endpoint :
Meilisearch Endpoint : /meilisearch

…ning the cluster without the right master key

Kerollmops · 2023-07-06T10:12:52Z

Hey 👋

We investigated this solution and wanted to look at what we can have and fix with a raft-based implementation.

1️⃣ The TCP connection used between the leader and the followers doesn't have the keepalive option enabled. Thus the connections are probably going to die often.

The Raft library could fix it, as we are able to use whatever we want as the network connection protocol. In the basic example they are using an HTTP connection with reqwest.

1️⃣ Add an internal interface

If you are talking about the public and private IPs in a cluster, it should be possible to have two interfaces with Raft.

1️⃣ The tasks received while joining the cluster might be lost

It should be ok with Raft. I hope. We must investigate that.

1️⃣ Handle what happens when a follower doesn’t get the same result the leader got from processing a batch

We must make sure we can express that a node is in the state. Returning that a node is non-healthy from the /health route, maybe?

1️⃣ What happens when the tick function need to re-run (cancel + MaxDatabaseSizeReached). For the potential users reading us, I think the whole cluster might get stuck for ever. If you have a « normal » Linux machine that should never happens though

It should be hard to trigger now that the index size is dynamic.

1️⃣ Currently, the only sync we have is made on the index operation

If you mean that the commit is only made on the index level, not the index-scheduler level, e.g., index swap, we can improve it later. If you mean that the API Keys are not supported, it is just a matter of time and effort.

2️⃣ When we send a task (with its update file) or a dump (to let the followers join the cluster), it must be stored entirely in RAM, which definitely won't scale on a small computer

It can be stored on disk in the future.

2️⃣ The followers are still able to receive writes (tasks or API keys), and I don't exactly know what happens in this case, but it's definitely nothing good

We must refuse direct API calls when not the leader.

3️⃣ We spawn like 200 threads that could all be a super small async rust routine that doesn't costs anything

OpenRaft is async and uses tokio. We could even maybe share the same tokio runtime as the one we use with Actix 😍

3️⃣ It doesn't work on Windows.

It should be ok with OpenRaft. That's a matter of Network and disk operations.

Be able to compute the same batches of tasks as the other nodes

We must ensure that it is possible with OpenRaft to define the tasks we can process.

Only commit when the quorum is ready to commit (direct consistency, not eventual consistency)

It should be possible, but we must look at what we can do about it.

Manage the healthiness of a node based on the number of tasks to catch up

We could look at the number of tasks to process to catch up with the number of tasks processed by the leader and change our healthiness based on this number.

irevoire added 12 commits March 14, 2023 17:38

start distributing meilisearch

c4c1240

move the leader to a new design

145f0e7

synchronize most of the operations

6cc14fe

plug the cluster thingy into meilisearch

de1b939

add some debug logs

f1aa225

first version working with index operation

3019078

get rids of my broken tricks/assert

498b59a

let the follower join the leader at any time

5ecfa35

remove a bunch of confusing logs

8ebc2b1

make the consistency configurable

3df5883

fix the tests

4058f6c

fix the analytics

8089ec6

irevoire added 5 commits March 22, 2023 14:24

fix the consistency rules

9112b26

remove an unused function

56f7e0d

sync the api key operations between the leader and followers

0a772fb

send the api keys in a dump

7ad1437

securise the connecions between the leader and followers + forbid joi…

38100d5

…ning the cluster without the right master key

curquiza linked an issue Mar 30, 2023 that may be closed by this pull request

About replicating Meilisearch #3494

Open

oluademola mentioned this pull request Jul 14, 2023

Running MeiliSearch on cluster #1383

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster #3593

Cluster #3593

irevoire commented Mar 16, 2023 •

edited

github-actions bot commented Mar 21, 2023 •

edited

Kerollmops commented Jul 6, 2023 •

edited

Cluster #3593

Are you sure you want to change the base?

Cluster #3593

Conversation

irevoire commented Mar 16, 2023 • edited

How does it work?

Processing a task

Joining the cluster

Synchronizing the API key

What new API pieces have been introduced:

What is utterly broken/ugly currently and should be rewritten / handled correctly

github-actions bot commented Mar 21, 2023 • edited

Uffizzi Ephemeral Environment deployment-19791

Kerollmops commented Jul 6, 2023 • edited

irevoire commented Mar 16, 2023 •

edited

github-actions bot commented Mar 21, 2023 •

edited

Uffizzi Ephemeral Environment `deployment-19791`

Kerollmops commented Jul 6, 2023 •

edited