Tablets: consider alternative implementation #983

Lorak-mmk · 2024-04-18T16:29:28Z

Current implementation of tablets that is going to be merged stores tablets for a given table in Vec.
This is ok for searching, but insterting / deleting tablets requires moving part of the vector.
The alternative would be implementation based on some tree (maybe BTreeMap is enough, maybe we would need external crate, I'm not sure). It would make insertion / deletion logarithmic instead of linear, but could have worse cache behavior.

It's hard to guess which version will be faster in practice - we should implement both and benchmark them on some workloads.

The text was updated successfully, but these errors were encountered:

mykaul · 2024-04-18T16:37:18Z

We are aiming (at least initially) for tablet size - which isn't too small - we'd like to have 5-10GB in size.
I'm not sure if that makes the design easier. I don't think insert/delete of a tablet is something that happens that often - we won't be seeing very often split/merge.

@avikivity - thoughts?

Lorak-mmk · 2024-04-18T16:45:53Z

We are aiming (at least initially) for tablet size - which isn't too small - we'd like to have 5-10GB in size. I'm not sure if that makes the design easier. I don't think insert/delete of a tablet is something that happens that often - we won't be seeing very often split/merge.

Inserts to tablets structure happen any time we receive info about the tablet. This is not only during split / merge but during normal work of a new session. After session creation the vector will be empty, and all queries will be non-token-aware. Then we'll start receiving tablets because we send queries to wrong nodes and we'll insert those tablets to the Vec.

Deletes will happen when received tablet overlaps with some tablet in the vector - this will happen, iiuc, during split / merge and after topology changes (because tablets will be rebalanced).

How many tablets may there be in a single table?

piodul · 2024-04-18T16:57:30Z

We are aiming (at least initially) for tablet size - which isn't too small - we'd like to have 5-10GB in size. I'm not sure if that makes the design easier. I don't think insert/delete of a tablet is something that happens that often - we won't be seeing very often split/merge.

Inserts to tablets structure happen any time we receive info about the tablet. This is not only during split / merge but during normal work of a new session. After session creation the vector will be empty, and all queries will be non-token-aware. Then we'll start receiving tablets because we send queries to wrong nodes and we'll insert those tablets to the Vec.

Aren't we going to batch updates to the tablets lookup structure and then do read-copy-update? I thought that we wanted to heavily optimized for the read case - search in Vec will be faster than lookup in BTreeMap if the structure is large enough.

Lorak-mmk · 2024-04-18T18:11:43Z

We are aiming (at least initially) for tablet size - which isn't too small - we'd like to have 5-10GB in size. I'm not sure if that makes the design easier. I don't think insert/delete of a tablet is something that happens that often - we won't be seeing very often split/merge.

Inserts to tablets structure happen any time we receive info about the tablet. This is not only during split / merge but during normal work of a new session. After session creation the vector will be empty, and all queries will be non-token-aware. Then we'll start receiving tablets because we send queries to wrong nodes and we'll insert those tablets to the Vec.

Aren't we going to batch updates to the tablets lookup structure and then do read-copy-update? I thought that we wanted to heavily optimized for the read case - search in Vec will be faster than lookup in BTreeMap if the structure is large enough.

We do batching, but it's not related to this issue. The procedure to update tablet structure is:

1. Receive all tablets from the channel
2. Clone ClusterData. We get new ClusterData, which contains cloned TabletsInfo.
3. For each received tablet t:
    3.1 Remove all tablets that overlap with t from TabletsInfo
    3.2 Add t to TabletsInfo
4. Replace old ClusterData with new ClusterData (which contains new tablets).

Batching here means receiving all tablets from channel in step 1, not only 1 tablet. This reduces the amount of times we need to perform step 2, which is beneficial because ClusterData is a big structure so cloning is expensive.
This issue is about optimizing step 3.

Lorak-mmk · 2024-05-01T17:03:38Z

Another performance optimization that I think is best to leave for a follow up: TabletReplicas should do less allocations in a typical case.
To do this, SmallVec should be used, and something similar for a map.
We should also consider using Arc for a key in the map there.

Lorak-mmk mentioned this issue Apr 18, 2024

Introduce support for Tablets #937

Merged

18 tasks

wprzytula added the performance Improves performance of existing features label May 14, 2024

wprzytula added this to the 1.1.0 milestone May 14, 2024

wprzytula added the load-balancing label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tablets: consider alternative implementation #983

Tablets: consider alternative implementation #983

Lorak-mmk commented Apr 18, 2024

mykaul commented Apr 18, 2024

Lorak-mmk commented Apr 18, 2024 •

edited

piodul commented Apr 18, 2024

Lorak-mmk commented Apr 18, 2024 •

edited

Lorak-mmk commented May 1, 2024

Tablets: consider alternative implementation #983

Tablets: consider alternative implementation #983

Comments

Lorak-mmk commented Apr 18, 2024

mykaul commented Apr 18, 2024

Lorak-mmk commented Apr 18, 2024 • edited

piodul commented Apr 18, 2024

Lorak-mmk commented Apr 18, 2024 • edited

Lorak-mmk commented May 1, 2024

Lorak-mmk commented Apr 18, 2024 •

edited

Lorak-mmk commented Apr 18, 2024 •

edited