Make Sled the only Oxigraph storage system #89

Tpt · 2021-03-27T17:38:47Z

Tpt
Mar 27, 2021
Maintainer

TL;DR: I am considering making Sled the only storage backend for Oxigraph.

Currently Oxigraph provides three different storage systems:

In memory
RocksDB
Sled

Providing three different storages is a challenge for Oxigraph development: often features have to be tweaked three times for the three different stores because of their differences. Having something performant requires the use of complex generics everywhere, cluttering the code and making it much harder to read and write. This significantly slows Oxigraph development speed.

The in memory store is currently very simplistic (global lock, copies on lookups...). The RocksDB store is efficient but very slow to compile (1-2 minutes) and has limited transaction features. The Sled store provides faster reads than RocksDB and competitive writes but is not stable yet and uses more disk space. A benchmark between RocksDB and Sled is provided at the end of this post.

However Sled 1.0 version and storage stability is coming soon, so I believe that making Oxigraph "Sled-only" might be a suitable choice.

Pros:

Simplify development (only one storage system)
Good read/write performances
Transaction support
Fast and easy compilation (no need for a C++ compiler)
Having one storage system allows to easily optimize the query/update system against it, leading to better performances
Sled allows us to use a temporary in-memory directory easily

Cons:

Sled is mostly a one person project but seems to get traction.
Sled disk amplification is higher than RocksDB
It requires dropping support of nodeJS/WASM, at least for now.
It means Oxigraph won't ever target distributed storages like TiKV.
Switching to a single storage system strongly links Oxigraph fate to this storage system

About WASM, it seems that much better approaches than Oxigraph are in progress for browsers and there seems to be people interested in making Sled work with WASI. So, it might be possible to add back NodeJS support in a few months/years.

Querying on distributed storages is very different from the local use-case so it is relevant to move it out of Oxigraph scope. Implementing efficient distributed querying in Oxigraph would mean to completely rework the query evaluation system anyway so it might make sense to leave this work to an other project.

If we go "Sled" only in Rust we should make the SPARQL parser and algebra reusable without dependency on Oxigraph, just like it has been done for the RDF parsers in order to encourage other SPARQL implementations.

To replace the most common MemoryStore use-cases, it might be also relevant to provide simple in-memory data structures without SPARQL support for graphs and datasets just like toolkits like RDF4J or Jena.

@gtfierro @pchampin @edmondchuc @dougli1sqrd @ktk @dwhitney Sorry for reaching you directly. You have interacted with me on Oxigraph in the past. Do you have any opinion on my proposal or do you see challenges I have not considered?

Benchmark: BSBM explore+update

The dataset used in the following charts is generated with 10k "products" (see its spec). It leads to the creation of 3.5M triples. It has been executed on a PrevailPro P3000 with 32GB of RAM.

The systems compared are the latest version of Oxigraph with RocksDB, Oxigraph with the current in-development version of Sled, Blazegraph 2.1.5 et GraphDB 9.3.3.

The parallelism factor is 5.

ktk · 2021-03-28T04:49:17Z

ktk
Mar 28, 2021

Thanks for the detailed description Thomas!

In general I support your proposal and I agree that focus is a good thing. Having too many targets is not helpful for reasons you well explain in this post.

What you propose is a SPARQL graph database built on a Rust only stack, with a focus on a single-node engine. That is in my opinion a very good focus and would definitely add something to the semantic web stack. There is clearly a need for a SPARQL endpoint that can be compiled and runs on everything from phones to large servers. I would expect it to work well on systems with many cores but from my (limited) understanding of Rust, that's part of the design. I also assume ARM support should not be a problem.

You point to an interesting paper for the WASM use-case, so dropping that target (at least for now) is not a problem IMO.

Using Sled instead of RocksDB might be a bit of a risk right now but it might also lead to a bigger reward once Sled evolves.

I never understood the in-memory use case. In the past 12 years I used that rarely if ever. That makes IMO more sense in environments like node & the browser so dropping that is perfectly fine for me. If someone wants that there are alternatives out there like Jena Fuseki and some others.

Distributed storage is again yet another field I would not go into right now. There are surely use-cases but right now they are niche and they can/could be built on existing stores as well.

1 reply

Tpt Mar 28, 2021
Maintainer Author

Thank you for your reply!

I would expect it to work well on systems with many cores but from my (limited) understanding of Rust, that's part of the design.

Yes and no. For now the query evaluation system is single thread. But executing multiple requests in parallel works very well. The query evaluation time is fairly stable if I scale the number of concurrent requests as soon as this number is lower than the number of CPU cores.

I also assume ARM support should not be a problem.

Yes, should work without problems.

I never understood the in-memory use case.

I use it sometime for quick experiments. For this use case we would still provide SledStore::new that creates a temporary directory that is automatically deleted when the database is closed. If the temporary space is in RAM, it should provide a fair in-memory storage.

pchampin · 2021-03-29T09:34:01Z

pchampin
Mar 29, 2021

Thanks indeed @Tpt for this detailed account.

Thanks also for citing out work on WASMTree, but I find you overly generous to qualify it as a "much better approach". At the moment, SPARQL query answering is way much faster with Oxigraph than with our approach. Granted, progress is till possible on our side (especially by integrating more SPARQL processing in the Rust part), but this still requires some significant work...

That being said, I understand your will to focus on Sled, and I think it is indeed a good move. Especially, I must say, if that helps making your SPARQL parser and algebra reusable in other projects, such as Sophia 😉. Self-interest aside, I think the two projects can move on in complementary ways: Oxigraph focusing on one specific and optimized implementation, and Sophia providing the generics traits to make it interoperable with other implementations...

PS: I feel your pain about generics ;-)

7 replies

Tpt Mar 29, 2021
Maintainer Author

Thank you for your feedback!

Especially, I must say, if that helps making your SPARQL parser and algebra reusable in other projects, such as Sophia.

Oxigraph 0.2 already publicly exposes the parser and algebra, I just have to move the code out of the main Oxigraph crate to avoid requiring to pull the complete Oxigraph library.

Self-interest aside, I think the two projects can move on in complementary ways: Oxigraph focusing on one specific and optimized implementation, and Sophia providing the generics traits to make it interoperable with other implementations...

Yes, definitely.

ktk Mar 30, 2021

It has been accepted at ESWC2021. It will probably have a different URL once actually "published" (this is the OpenReview system), but this is the camera-ready version.

ok if changes are still possible: The short form of RDF JavaScript Libraries is RDFJS, not RDF/JS or RDF.JS.

pchampin Mar 30, 2021

@ktk Ok I'll see if we can still update it... But you might want to also inform the editors of the specs 😈 http://rdf.js.org/data-model-spec/

ktk Mar 30, 2021

@pchampin yeah I noticed it's using multiple forms, but RDF.JS is not one of them ;) Will check that we align that.

pricesmith Jan 27, 2023

Thanks indeed @Tpt for this detailed account.

Thanks also for citing out work on WASMTree, but I find you overly generous to qualify it as a "much better approach". At the moment, SPARQL query answering is way much faster with Oxigraph than with our approach. Granted, progress is till possible on our side (especially by integrating more SPARQL processing in the Rust part), but this still requires some significant work...

That being said, I understand your will to focus on Sled, and I think it is indeed a good move. Especially, I must say, if that helps making your SPARQL parser and algebra reusable in other projects, such as Sophia 😉. Self-interest aside, I think the two projects can move on in complementary ways: Oxigraph focusing on one specific and optimized implementation, and Sophia providing the generics traits to make it interoperable with other implementations...

PS: I feel your pain about generics ;-)

I've been looking

WASMTree

I'd planned on mentioning what seems like some adjacent, parallel work between both of these projects ("all three" including WASMTree, which I've only just discovered), and am glad to see you chimed in to speak to it all better than I can! -- Hoping to contribute to these much more in the future! And if there is some alignment that furthers both Sophia and Oxigraph simultaneously, I'm more than excited taking on some git issues, if they're in my wheelhouse.

gtfierro · 2021-03-29T18:29:14Z

gtfierro
Mar 29, 2021

Thanks @Tpt for pinging me for feedback and for the detailed description of the issues and tradeoffs! I agree with the others that the move to Sled-only is well-motivated and should hopefully pay off as an investment that leads to more growth for Sled. My main concern about Sled is that I have observed it using a fair amount of I/O even when idle, which presents challenges for "embedded" deployments that may use SD cards for storage, but hopefully this is something that becomes addressed as Sled matures.

It is also nice to use in-memory storage for temporary deployments or testing, so having the ability for Sled to handle that from the oxigraph frontend (without having to manually create a tmpfs filesystem) is great as well

1 reply

Tpt Mar 29, 2021
Maintainer Author

Thank you for your feedback and your support!

My main concern about Sled is that I have observed it using a fair amount of I/O even when idle, which presents challenges for "embedded" deployments that may use SD cards for storage, but hopefully this is something that becomes addressed as Sled matures.

That's a limitation indeed. Sled has evolved quite a lot in the past few month, it might be nice to see if it is going to improve in the next releases.

edmondchuc · 2021-03-31T03:39:30Z

edmondchuc
Mar 31, 2021

@Tpt Thanks for including me in this discussion. I agree with your approach on focusing on one storage system. As others have mentioned already, the only perceivable risk is the maturity level of Sled, but likely this decision will pay off in the future.

Similar projects - IndraDB
I am not sure if you are aware, but there is a very similar graph database project called IndraDB. It is also using RocksDB and Sled. They have a few GitHub issues providing some insight into some of the problems they are facing using RocksDB and Sled.

On their README, they state:

NOTE: The sled datastore is not production-ready yet. sled itself is pre-1.0, and makes no guarantees about on-disk format stability. Upgrading IndraDB may require you to manually migrate the sled datastore. Additionally, there is a standing issue that prevents the sled datastore from having the same level of safety as the RocksDB datastore.

Probably a worthwhile project to keep an eye on to learn from what they find with using Sled in the future.

1 reply

Tpt Mar 31, 2021
Maintainer Author

Thank you for talking about IndraDB and your nice feedback. It is indeed a very similar project.

About their two issues:

RocksDB tuning indradb/indradb#115 Indeed RocksDB configuration might be tricky. I have not done much about it in Oxigraph. But if we drop RocksDB support it won't matter much anyway.
Sled datastore: operations should be run in a transaction indradb/indradb#98 Indeed Sled does not support ACID read/write transactions with range queries (c.f. Range scans in transactional context spacejam/sled#1143). This task seems more about having InraDB doing properly write-only transactions. RocksDB allows to do snapshot and it might be possible to use them with batch writes to create proper read/write transactions. But it might be tricky to get it right so waiting for Sled to implement read/write transactions and focus more on query optimisations might be a good plan for Oxigraph at the moment. Their is still a lot of possible performance improvements in this area.

About Sled stability, I am stating this talk about dropping support of RocksDB in Oxigraph because Sled 1.0 with disk-format stability should hopefully be released soon. I plan to wait for Sled 1.0 before making a "Sled-only" Oxigraph release.

Tpt · 2021-04-29T06:24:47Z

Tpt
Apr 29, 2021
Maintainer Author

Thank you so much for your feedbacks. I have moved forward with my proposal and implemented it in the v0.3 git branch. I am currently working in this branch to add new features like RDF-star support and garbage collection of unused terms.

0 replies

dougli1sqrd · 2022-02-15T20:29:55Z

dougli1sqrd
Feb 15, 2022

Hi @Tpt! Sorry I missed this notification a long time ago!

I see now that you've decided on RocksDB which I totally understand. I do love the romantic notion that we'd have a fully rust-only RDF graph store, but get why you chose RocksDB. I'm still reading up on everything that's happened since the last time I was looking at oxigraph, but would love to contribute and play around with the code some more! I'm still interested in Shex/Shacl parsing and implementation as well. Glad to see that this is still going!

1 reply

Tpt Feb 16, 2022
Maintainer Author

I do love the romantic notion that we'd have a fully rust-only RDF graph store, but get why you chose RocksDB.

To say the truth, me too. We might change this decision in a few years if it make sense. Sled seems to be getting rewritten from the ground up so we would have had to make big migration anyway if we would have went the Sled road. I am also starting to like more and more the idea of building a custom storage system for Oxigraph that would allow fancy indexing techniques. But it's definitely not for the next couple of years (except if there are volunteers, of courses).

I'm still interested in Shex/Shacl parsing and implementation as well.

Amazing! If you want to contribute around that, feel free to ping me. A great first step would maybe to create something like spargebra for shex and/or shacl.

gtfierro · 2022-02-16T20:23:02Z

gtfierro
Feb 16, 2022

I am ok with not having a rust-only RDF graph store, but I would like to lend support for the in-memory backend that can be used to run web-based query processing (e.g. https://sparql.gtf.fyi/). This has proven invaluable for the many non-technical people I work with because they don't need to install or run anything --- they just bookmark a page in the browser and can run all the queries they want. Sure, it doesn't scale as much, but it doesn't need to.

I am also very interested in SHACL processing w/n the Oxigraph ecosystem. I've done a little work with OWL 2 RL (https://github.com/gtfierro/reasonable) --- if anyone is interested in digging into SHACL, let me know! I'd love to collaborate

3 replies

Tpt Feb 16, 2022
Maintainer Author

The 0.3 version (currently in beta) still provides "Oxigraph JS" with an in-memory backend. This backend is very simplistic and prone to deadlocks (but not worsts than in the previous versions and it's likely not a big deal because JS is single threaded anyway). In the far fetched future I would love to see a fast SPARQL implementation for the web platform, maybe allowing persistent storage with IndexedDB. I am quite unhappy with the current OxigraphJS design and I have the feeling of making Oxigraph both a great web library and a great "regular" on disk database seems sometime like going to two opposite directions at the same time. I am starting to try to move out of the main Oxigraph crate as much code as possible in order to a separated "SPARQL implementation for the Web in WASM" much easier.

Amazing to see more SHACL interest! My current priority is mostly more performances of the current features but I would love to help if someone if willing to take the lead on SHACL/ShEx.

gtfierro Feb 16, 2022

I am quite unhappy with the current OxigraphJS design and I have the feeling of making Oxigraph both a great web library and a great "regular" on disk database seems sometime like going to two opposite directions at the same time.

I can see the wisdom behind this, especially if you are planning on eventually scaling storage/query processing beyond one machine.

I am starting to try to move out of the main Oxigraph crate as much code as possible in order to a separated "SPARQL implementation for the Web in WASM" much easier.

I really like this idea, and I've already found the https://crates.io/crates/spargebra repo, which is going to be a huge help to my own development efforts. I'll keep an eye out for the future decoupling of the Oxigraph components in the future. I'm hoping to have some cycles w/n the next couple months to start digging into the SHACL/ShEx stuff .

Thanks for all your work on Oxigraph! It's been great to work with

Tpt Feb 17, 2022
Maintainer Author

I'll keep an eye out for the future decoupling of the Oxigraph components in the future.

Great! There is also now sparesults and the older rio that I believe you already know.

I'm hoping to have some cycles w/n the next couple months to start digging into the SHACL/ShEx stuff .

Amazing! Thank you!

solson · 2023-07-30T21:59:59Z

solson
Jul 30, 2023

@Tpt In case you haven't seen this yet — it's early days, but after a long time working in private and on the Komora GitHub org, Tyler Neely has started doing sled 1.0 prereleases.

Ok, the rough-cut version of sled 1.0 is ready for people to try to break!

— https://twitter.com/sadisticsystems/status/1684906383227961344

Here's a description of the sled 1.0 storage architecture.

— https://twitter.com/sadisticsystems/status/1685357851210883072

1 reply

Tpt Jul 31, 2023
Maintainer Author

Yes, it's amazing! And Komora components are now under the same license as Oxigraph so we might consider building our own storage system using these components. It might allow us to implement things like better data structures for string encoding and quad indexes without having to implement a storage system from scratch.

donpellegrino · 2023-10-19T18:30:08Z

donpellegrino
Oct 19, 2023

I see from the CHANGELOG.md for "[0.3.0-beta.1] - 2022-01-29" that Sled was removed. Reading through this discussion, I am not sure what the final conclusion was. I think a pure-Rust storage implementation is still appealing. Would using Cargo.toml features to include multiple storage backend be feasible?

4 replies

Tpt Oct 19, 2023
Maintainer Author

That's a great question.

Would using Cargo.toml features to include multiple storage backend be feasible?

Yes, but that would be cumbersome with the current Oxigraph architecture. I hope to do some refactoring (closely related to the one regarding HDT) that would make it much easier. Sled seemed abandoned at the time of this decision and I found maintaining two backend painful so I made the choice of dropping Sled.

adamreichold Oct 19, 2023

If you do make the storage backends pluggable, then another contender for a pure Rust backend might be redb which is most likely not as scalable as RocksDB or Sled, but could be a nice alternative for "small" use cases.

Tpt Oct 19, 2023
Maintainer Author

Yes! That would be amazing!

donpellegrino Oct 20, 2023

Thanks for the clarification. That makes perfect sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Sled the only Oxigraph storage system #89

{{title}}

Replies: 9 comments 19 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Make Sled the only Oxigraph storage system #89

Tpt Mar 27, 2021 Maintainer

Benchmark: BSBM explore+update

Replies: 9 comments · 19 replies

Tpt Mar 28, 2021 Maintainer Author

Tpt Mar 29, 2021 Maintainer Author

Tpt Mar 29, 2021 Maintainer Author

Tpt Mar 31, 2021 Maintainer Author

Tpt Apr 29, 2021 Maintainer Author

Tpt Feb 16, 2022 Maintainer Author

Tpt Feb 16, 2022 Maintainer Author

Tpt Feb 17, 2022 Maintainer Author

Tpt Jul 31, 2023 Maintainer Author

Tpt Oct 19, 2023 Maintainer Author

Tpt Oct 19, 2023 Maintainer Author

Tpt
Mar 27, 2021
Maintainer

Replies: 9 comments 19 replies

Tpt Mar 28, 2021
Maintainer Author

Tpt Mar 29, 2021
Maintainer Author

Tpt Mar 29, 2021
Maintainer Author

Tpt Mar 31, 2021
Maintainer Author

Tpt
Apr 29, 2021
Maintainer Author

Tpt Feb 16, 2022
Maintainer Author

Tpt Feb 16, 2022
Maintainer Author

Tpt Feb 17, 2022
Maintainer Author

Tpt Jul 31, 2023
Maintainer Author

Tpt Oct 19, 2023
Maintainer Author

Tpt Oct 19, 2023
Maintainer Author