Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discussion] Higher performance "remote" validation #226

Open
ashleysommer opened this issue Apr 10, 2024 · 0 comments
Open

[Discussion] Higher performance "remote" validation #226

ashleysommer opened this issue Apr 10, 2024 · 0 comments

Comments

@ashleysommer
Copy link
Collaborator

ashleysommer commented Apr 10, 2024

This is something I've been thinking about for a while, since it was originally introduced in #174

The issue is PySHACL is primarily designed to run on Graphs in memory using the RDFLib memory store. There are two primary reasons for this:

  1. PySHACL creates a copy of the input graph, to a new in-memory graph, to do operations on it, to avoid polluting the input graph
  2. PySHACL uses native RDFLib graph operations (eg, graph.subjects(), graph.objects(), graph.rest()) these are atomic graph operations that read directly from the underlying graph store, these operations are hand-built and hand-tweaked for each SHACL constraint to achieve maximum performance.

These two concerns do not translate well to "remote" graphs, where remote means graphs that are not in a RDFlib local store, they live in a Graph Store service, and are accessed via a SPARQL endpoint. This can be the case if you're validating against a sparqlstore or sparqlconnector graph in RDFLib, or using a SparqlWrapper() on your graph.

In the remote case, it is not efficient and often not desirable (or not possible) to create a full in-memory working-copy of the remote graph into a memory-backed rdflib graph. And it is very bad for performance if we're running atomic graph lookup operations via the SPARQL connector, because this results in tens or hundreds of individual synchronous SPARQL queries executed against the remote graph for each constraint evaluated.

So I'm proposing a new mode of operation for PySHACL, some kind of "SPARQL-optimised" form, or "remote" mode that will cause PySHACL to use purpose-build SPARQL queries to perform validation instead of RDFLib graph operations. This will be an implementation of the "driver only" interpretation of PySHACL as proposed in #174. The key distinction being this new mode will not replace the normal operating mode of PySHACL, and will not affect performance for users who primarily use in-memory graph validation.

There are some questions to think about:

  1. Could this be a commandline switch or validator argument? Something the user can switch on manually? Or should it be auto-detected if the user is passing in a sparqlconnector, sparqlstore or SparqlWrapper graph. Can we simply use a https:// SPARQL endpoint as the graph argument on the commandline and have it work automatically?
  2. As we're not creating a working-copy of the graph for the validation, does that mean we must avoid polluting the source graph? That means we cannot do any OWL/RDFS inferencing, no SHACL Rules can be applied in this mode, and SHACL functions must also be turned off (as these can pollute the graph too) in remote mode.
  3. Are there some cases when we do want to pollute the graph? Eg, trying to use PySHACL as a SHACL Rule engine, where you do want the new triples to appear in the source graph. This doesn't make sense to do in a in-memory local graph, but I see the utility of doing it on a remote.
@ashleysommer ashleysommer changed the title Higher performance "remote" validation [Discussion] Higher performance "remote" validation Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant