jena-geosparql - Add assembler option to disable spatial index #1344

vtermanis · 2022-05-31T10:54:20Z

The Spatial Index is generated on server startup and, as per design, thereafter cannot be updated (until the next Fuseki restart).

Currently there are two options for the Spatial Index in assembler configuration:

geosparql:spatialIndexFile set => Index loaded from / generated + written to disk on startup
geosparql:spatialIndexFile unset => Index generated in memory

For a read + write dataset, said index is not very useful (in that startup time is wasted to re-generate the index which then is out-of-date after the next write op). This proposal adds a a new assembler option, geosparql:spatialIndexEnabled (defaulting to true) so that there now is a third mode:

geosparql:spatialIndexEnabled set to false => geosparql:spatialIndexFile is ignored and no index is loaded or generated

- Now have to/from-file, in-memory and no index options

afs · 2022-05-31T12:01:28Z

There are no tests covering the assembler change nor the functionality change.

Experts - what is the impact of no index on performance?

vtermanis · 2022-05-31T17:37:40Z

There are no tests covering the assembler change nor the functionality change.

I did look to see what there was, but like you say, the assembler part is not currently tested. For the assembler, I presume you mean something like this? Or would it be more appropriate to explicitly call the updated assembler's createDataset and inspect the output (e.g. dataset has no spatial index in its context)?

afs · 2022-05-31T19:44:50Z

Whatever works for the GeoSPARQL interest community.

A way like Fuseki main :: TestSecurityConfig is launching a server with a configuration and sending requests for testing.

LorenzBuehmann · 2022-06-03T15:21:13Z

Experts - what is the impact of no index on performance?

Not an expert but using Fuseki with GeoSPARQL for a longer time now ...

Containment checks can be way slower without index usage:
For example, currently, spatial containment queries that lead to point in polygon checks can make use of the index first (takes an envelope of the polygon, i.e. a rectangle to gather all points in this rectangle followed by a second check for proper point in polygon check necessary to filter points not in the polygon - for a large datasets and a small polygon this can be a huge performance gain. I don't have that exhaustive numbers at the moment though a minor example on a dataset about companies (2,374,998 in total):

The query gives the number of companies (10,270) in a small part of Germany:

SELECT (count(?c) as ?cnt) {
   BIND("POLYGON((7.654288035299954 51.82366598560922,11.257803660299954 51.82366598560922,11.257803660299954 49.59800926392628,7.654288035299954 49.59800926392628,7.654288035299954 51.82366598560922))"^^geo:wktLiteral as ?box)
  ?c spatial:withinBoxGeom(?box) . # the explicit spatial index lookup
  ?c a coy:Company ;
  geo:hasGeometry/geo:asWKT ?lit .  
  FILTER(geof:sfContains(?box, ?lit))
}

with the index lookup triple pattern it takes 0.1s, without it takes ~10s.

neumarcx · 2022-06-03T15:27:01Z

Now you can compare that with a manual MBR.

afs · 2022-06-10T18:13:21Z

The PR will add an option to make jena-geosparql ignore any persistent index. All lookups will only look in the geosparql RDF data. This way, queries are correct with respect to data updates but slow.

Is this the right thing to include in the codebase?

@vtermanis - at what scale have you used this? Does that usage include containment queries?

I propose merging this if there is a PR to update the
documentation
(https://github.com/apache/jena-site/blob/main/source/documentation/geosparql/geosparql-assembler.md).

Is there a reason why the index can't be updated?

LorenzBuehmann · 2022-06-12T09:26:22Z

Nitpicking: why would we call the method prepareSpatialExtension at all if the spatial index isn't enabled? All it would do is to check emptiness of the dataset (which has no benefit) and then return in the next ìf` clause -> no need to call the method

Is there a reason why the index can't be updated?

@afs The reason is the underlying datastructure of JTS, the STRtree to which items cannot be inserted once it has been built. We could allow for an update mode and switch to a Quadtree (a bit slower, but allows for insert/remove operations).
Moreover, we will have think careful about updating the "other" indexing structure of the geospatial layer as well, i.e. that literal, transformation and query rewrite part I think.

afs · 2022-06-12T20:06:02Z

@LorenzBuehmann thank you for the background. jena-goesparql isn't an area I have looked into much and it has quite a high learning curve.

All - what are the implications of using Quadtree? Is it a relatively contained change in class SpatialIndex or does it have wider implications? What, very roughly, is the performance difference of an STRtree and a Quadtree? What about #1327 (PR for "allow geo index search for literals")?

LorenzBuehmann · 2022-06-13T07:03:09Z

This article contains some numbers for JTS Quadtree vs STRtree: https://link.springer.com/article/10.1007/s41019-020-00147-9

It covers

indexing costs
index size
range queries
distance queries
point-in-polygon join query

We could keep the STRtree for read-only datasets, and I think we have to live with the Quadtree for read-write Datasets. Internally only query operation is called on the STRtree, thus changing the datastructure should be trivial.

vtermanis · 2022-06-13T08:08:28Z

at what scale have you used this? Does that usage include containment queries?

@afs , we've only used the geof:(distance|sfWithin|sfContains) functions so far, the latter two with geof:buffer only. The scale is small for now (100k geometries).

we will have think careful about updating the "other" indexing structure of the geospatial layer as well, i.e. that literal, transformation and query rewrite part I think.

@LorenzBuehmann, do you mean because of the suggested QuadTree change for the spatial index or from a general performance perspective? (I saw your suggestion on using a different caching lib in Jira.)

vtermanis · 2022-06-13T08:17:57Z

(sorry, one more Q @LorenzBuehmann )

we have to live with the Quadtree for read-write Datasets.

What would it mean for persistence? (From my understanding the current STRtree index is serialised to disk in full.)
For the case where a write-heavy dataset is only used sparingly for GeoSPARQL queries, is it still useful to offer the "no index" option also, i.e.:

STRtree index pre-generated either to file or into memory (current mode)
QuadTree index updated during writes (in memory and/or disk?)
1. Can update geometries in data & continue to perform spatial queries
2. If have existing large dataset, have to pre-generate initial index on startup
Spatial index disabled
1. No write & startup perf impact (if (2) persisted)
2. GeoSPARQL queries slow(er), choose option 1 or 2 if this matters

afs · 2022-06-13T08:26:05Z

(I [@vtermanis] saw your suggestion on using a different caching lib in Jira.)

JENA-2311 and PR #1235.

LorenzBuehmann · 2022-06-13T09:46:45Z

do you mean because of the suggested QuadTree change for the spatial index or from a general performance perspective? (I saw your suggestion on using a different caching lib in Jira.)

@vtermanis I mean, once we allow for updates, in particular for removal we might have to address the current caching, i.e. maybe just invalidate or empty the current cache in the simplest case

What would it mean for persistence? (From my understanding the current STRtree index is serialised to disk in full.)

Yep, one of the things that would have to be discussed. I don't think JTS provides any disk-mapped datastructure, which means it remains open to when to persist the updates - that's always the case for in-memory index structures.

afs · 2022-06-17T09:28:15Z

The persistence is part of jena-geosparql:

https://github.com/apache/jena/blob/main/jena-geosparql/src/main/java/org/apache/jena/geosparql/spatial/SpatialIndexStorage.java

jena/jena-geosparql/src/main/java/org/apache/jena/geosparql/spatial/SpatialIndex.java

Line 147 in ebb8b12

    
           public final void insertItems(Collection<SpatialIndexItem> indexItems) throws SpatialIndexException {

LorenzBuehmann · 2022-06-24T06:44:22Z

Well, that does only add items to the index before it is finally built and remains after that immutable. It then serializes the index as Java object stream to disk. Just the collection of items though, not the underlying STR-Tree - this will be rebuild each time on startup.

But there is no mechanism yet that would write changes made to a mutable R-Tree index to disk then, i.e. it would only be changed in-memory, but the question would be how to make those changes persistent. Re-serializing the index each time the RDF graph is being changed seems to be infeasible as it is somewhat slow for larger indexes and it currently just dumps the whole index.
The main problem is just that JTS doesn't provide any on-disk index afaik.

Aklakan · 2022-06-25T18:45:12Z

Ideally there would be a persistent R-Tree implementation similar to dboe's BPlusTree.

But even just serializing the in-memory data structure as a whole rather then having to rebuild it on start-up would be an improvement.
Also, using kryo serialization (is BSD-3 compatible with Apache v2?) would most likely be faster than java serialization.
I suppose parallel de-/serialization of tree data structures should be rather trivial to implement when going with the in-memory index solution for now.

One approach is also to represent grid cells (with optional nesting) as IDs and then link spatial objects to the grid cell ids - so a kind of poor man's quad tree represented in a B+ tree. This could be implemented with the TDB machinery - but not sure whether that'd be a worthwhile endeavor.

SimonBin · 2023-01-27T12:39:07Z

@vtermanis I believe you can use the geof: functions without wrapping in a GeoDS so you don't need to add this option :-)

vtermanis · 2023-01-29T23:50:23Z

I believe you can use the geof: functions without wrapping in a GeoDS so you don't need to add this option :-)

That's a good idea @SimonBin - but then surely Geometry Literal, Geometry Transform, Query Rewrite indexes/caching won't be available (which from my understanding are still useful for repeated queries against the same geometries).

SimonBin · 2023-01-30T09:19:13Z

I see, you're right. I guess this small addition to the code is straight forward and won't hurt.

(I also noticed that the code in fact needs to be updated because currently it uses a single Cache for all DSes)

davidmireles · 2024-01-17T14:34:12Z

What is/was the outcome of the discussion on enabling updates to the geo-sparql spatial index? I find this to be one limiting aspect of the Jena geo-sparql implementation that a number of other triple stores provide out-of-the box, and would be a very desired addition.

SimonBin · 2024-01-17T14:38:28Z

maybe we shouldn't derail this thread, but as a stop-gap solution to your concern, we have implemented a method to manually update the geospatial index, which is currently good enough for our project : https://github.com/AKSW/fuseki-mods/tree/adaptions/jena-fmod-geosparql/src/main/java/org/apache/jena/fuseki/mod/geosparql

jena-geosparql - Add assembler option to disable spatial index

be6a377

- Now have to/from-file, in-memory and no index options

afs added the GeoSPARQL label May 31, 2022

afs mentioned this pull request Jul 11, 2023

[FEATURE REQUEST] Add GeoSPARQL support on arq and tdbquery #1953

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jena-geosparql - Add assembler option to disable spatial index #1344

jena-geosparql - Add assembler option to disable spatial index #1344

vtermanis commented May 31, 2022 •

edited

afs commented May 31, 2022

vtermanis commented May 31, 2022

afs commented May 31, 2022

LorenzBuehmann commented Jun 3, 2022 •

edited

neumarcx commented Jun 3, 2022 via email

afs commented Jun 10, 2022

LorenzBuehmann commented Jun 12, 2022 •

edited

afs commented Jun 12, 2022 •

edited

LorenzBuehmann commented Jun 13, 2022

vtermanis commented Jun 13, 2022

vtermanis commented Jun 13, 2022

afs commented Jun 13, 2022

LorenzBuehmann commented Jun 13, 2022

afs commented Jun 17, 2022

LorenzBuehmann commented Jun 24, 2022

Aklakan commented Jun 25, 2022 •

edited

SimonBin commented Jan 27, 2023

vtermanis commented Jan 29, 2023 •

edited

SimonBin commented Jan 30, 2023

davidmireles commented Jan 17, 2024

SimonBin commented Jan 17, 2024

jena-geosparql - Add assembler option to disable spatial index #1344

Are you sure you want to change the base?

jena-geosparql - Add assembler option to disable spatial index #1344

Conversation

vtermanis commented May 31, 2022 • edited

afs commented May 31, 2022

vtermanis commented May 31, 2022

afs commented May 31, 2022

LorenzBuehmann commented Jun 3, 2022 • edited

neumarcx commented Jun 3, 2022 via email

afs commented Jun 10, 2022

LorenzBuehmann commented Jun 12, 2022 • edited

afs commented Jun 12, 2022 • edited

LorenzBuehmann commented Jun 13, 2022

vtermanis commented Jun 13, 2022

vtermanis commented Jun 13, 2022

afs commented Jun 13, 2022

LorenzBuehmann commented Jun 13, 2022

afs commented Jun 17, 2022

LorenzBuehmann commented Jun 24, 2022

Aklakan commented Jun 25, 2022 • edited

SimonBin commented Jan 27, 2023

vtermanis commented Jan 29, 2023 • edited

SimonBin commented Jan 30, 2023

davidmireles commented Jan 17, 2024

SimonBin commented Jan 17, 2024

vtermanis commented May 31, 2022 •

edited

LorenzBuehmann commented Jun 3, 2022 •

edited

LorenzBuehmann commented Jun 12, 2022 •

edited

afs commented Jun 12, 2022 •

edited

Aklakan commented Jun 25, 2022 •

edited

vtermanis commented Jan 29, 2023 •

edited