-
Notifications
You must be signed in to change notification settings - Fork 652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
jena-geosparql - Add assembler option to disable spatial index #1344
base: main
Are you sure you want to change the base?
jena-geosparql - Add assembler option to disable spatial index #1344
Conversation
- Now have to/from-file, in-memory and no index options
There are no tests covering the assembler change nor the functionality change. Experts - what is the impact of no index on performance? |
I did look to see what there was, but like you say, the assembler part is not currently tested. For the assembler, I presume you mean something like this? Or would it be more appropriate to explicitly call the updated assembler's |
Whatever works for the GeoSPARQL interest community. A way like Fuseki main :: TestSecurityConfig is launching a server with a configuration and sending requests for testing. |
Not an expert but using Fuseki with GeoSPARQL for a longer time now ... Containment checks can be way slower without index usage: The query gives the number of companies ( SELECT (count(?c) as ?cnt) {
BIND("POLYGON((7.654288035299954 51.82366598560922,11.257803660299954 51.82366598560922,11.257803660299954 49.59800926392628,7.654288035299954 49.59800926392628,7.654288035299954 51.82366598560922))"^^geo:wktLiteral as ?box)
?c spatial:withinBoxGeom(?box) . # the explicit spatial index lookup
?c a coy:Company ;
geo:hasGeometry/geo:asWKT ?lit .
FILTER(geof:sfContains(?box, ?lit))
} with the index lookup triple pattern it takes 0.1s, without it takes ~10s. |
Now you can compare that with a manual MBR.
|
The PR will add an option to make jena-geosparql ignore any persistent index. All lookups will only look in the geosparql RDF data. This way, queries are correct with respect to data updates but slow. Is this the right thing to include in the codebase? @vtermanis - at what scale have you used this? Does that usage include containment queries? I propose merging this if there is a PR to update the Is there a reason why the index can't be updated? |
Nitpicking: why would we call the method
@afs The reason is the underlying datastructure of JTS, the |
@LorenzBuehmann thank you for the background. All - what are the implications of using |
This article contains some numbers for JTS It covers
We could keep the |
@afs , we've only used the
@LorenzBuehmann, do you mean because of the suggested |
(sorry, one more Q @LorenzBuehmann )
What would it mean for persistence? (From my understanding the current
|
|
@vtermanis I mean, once we allow for updates, in particular for removal we might have to address the current caching, i.e. maybe just invalidate or empty the current cache in the simplest case
Yep, one of the things that would have to be discussed. I don't think JTS provides any disk-mapped datastructure, which means it remains open to when to persist the updates - that's always the case for in-memory index structures. |
The persistence is part of jena/jena-geosparql/src/main/java/org/apache/jena/geosparql/spatial/SpatialIndex.java Line 147 in ebb8b12
|
Well, that does only add items to the index before it is finally built and remains after that immutable. It then serializes the index as Java object stream to disk. Just the collection of items though, not the underlying STR-Tree - this will be rebuild each time on startup. But there is no mechanism yet that would write changes made to a mutable R-Tree index to disk then, i.e. it would only be changed in-memory, but the question would be how to make those changes persistent. Re-serializing the index each time the RDF graph is being changed seems to be infeasible as it is somewhat slow for larger indexes and it currently just dumps the whole index. |
Ideally there would be a persistent R-Tree implementation similar to dboe's BPlusTree. But even just serializing the in-memory data structure as a whole rather then having to rebuild it on start-up would be an improvement. One approach is also to represent grid cells (with optional nesting) as IDs and then link spatial objects to the grid cell ids - so a kind of poor man's quad tree represented in a B+ tree. This could be implemented with the TDB machinery - but not sure whether that'd be a worthwhile endeavor. |
@vtermanis I believe you can use the geof: functions without wrapping in a GeoDS so you don't need to add this option :-) |
That's a good idea @SimonBin - but then surely |
I see, you're right. I guess this small addition to the code is straight forward and won't hurt. (I also noticed that the code in fact needs to be updated because currently it uses a single Cache for all DSes) |
What is/was the outcome of the discussion on enabling updates to the geo-sparql spatial index? I find this to be one limiting aspect of the Jena geo-sparql implementation that a number of other triple stores provide out-of-the box, and would be a very desired addition. |
maybe we shouldn't derail this thread, but as a stop-gap solution to your concern, we have implemented a method to manually update the geospatial index, which is currently good enough for our project : https://github.com/AKSW/fuseki-mods/tree/adaptions/jena-fmod-geosparql/src/main/java/org/apache/jena/fuseki/mod/geosparql |
The Spatial Index is generated on server startup and, as per design, thereafter cannot be updated (until the next Fuseki restart).
Currently there are two options for the Spatial Index in assembler configuration:
geosparql:spatialIndexFile
set => Index loaded from / generated + written to disk on startupgeosparql:spatialIndexFile
unset => Index generated in memoryFor a read + write dataset, said index is not very useful (in that startup time is wasted to re-generate the index which then is out-of-date after the next write op). This proposal adds a a new assembler option,
geosparql:spatialIndexEnabled
(defaulting totrue
) so that there now is a third mode:geosparql:spatialIndexEnabled
set tofalse
=>geosparql:spatialIndexFile
is ignored and no index is loaded or generated