Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PathZero generates a _lot_ of unnecessary computation #49

Open
doriantaylor opened this issue Nov 4, 2023 · 7 comments
Open

PathZero generates a _lot_ of unnecessary computation #49

doriantaylor opened this issue Nov 4, 2023 · 7 comments
Labels
help wanted Extra attention is needed

Comments

@doriantaylor
Copy link
Contributor

Hey, trying to execute a SPARQL query with the path ?s ^(skos:memberList/rdf:rest*/rdf:first) ?o where ?s is bound, on a graph of about 35,000 statements. Presumably the query will eventually complete, but only after what looks like comparing the entire graph against itself (cartesian product). Here is a flame graph from rbspy that shows the computation is heavily concentrated under SPARQL::Algebra::Operator::PathZero#execute:

2023-11-04-HTUmhkLJcq flamegraph

I have isolated the behaviour in the following query:

select distinct ?o ?c where {
  { ?o a ?c FILTER (isIRI(?o) && isIRI(?c)) }
  { <urn:uuid:00ae5130-bb3f-478b-b97a-e868c8bb17d8> ^(skos:memberList/rdf:rest*/rdf:first) ?o }
}

Note that this semantically equivalent query:

select distinct ?o ?c where {
  { ?o a ?c FILTER (isIRI(?o) && isIRI(?c)) }
  { <urn:uuid:00ae5130-bb3f-478b-b97a-e868c8bb17d8> ^(skos:memberList/(rdf:first|rdf:rest+/rdf:first)) ?o }
}

…completes in no time at all.

I suspect there may potentially be a strategic issue with the way PathZero#execute works. When both subject and object are variables, the behaviour appears to be to scan the entire graph for every subject and object and map both subject and object variables to the same values. It also executes solutions.include? for each solution (once for the subject query and once for the objects), amounting to a sequential scan (O(n²)) every time, and invocation of RDF::Query::Solution#==, each time spurring multiple invocations of RDF::URI#== and RDF::Literal#==, the latter of those two being quite expensive. One additional remark about this situation is that as it stands, literals can't be subjects, so the vast majority of these comparisons—which also happen to be the most expensive to perform—are unnecessary.

It seems to me that (path ?x (path0 :p) ?y) is like saying "anything that is a solution to ?x is also a solution to ?y (and almost the same the other way around, modulo literals), so without any additional context that is indeed like saying "every subject in the graph which is also an object". The thing is, there should be additional context before this operator is executed, if its default behaviour is to scan the entire graph and compute its cartesian product, only to throw away the results.

@doriantaylor doriantaylor changed the title PathZero generates a *lot* of unnecessary computation PathZero generates a _lot_ of unnecessary computation Nov 4, 2023
@gkellogg
Copy link
Member

gkellogg commented Nov 4, 2023

I’m sure you’re right. Emphasis was on getting the functionality in rather than optimization. A PR would be welcome.

@gkellogg gkellogg added the help wanted Extra attention is needed label Nov 4, 2023
@doriantaylor
Copy link
Contributor Author

I can try when I have some time but I think I will have to deal with the workaround in the interim. Do you know where path0 lives in the SPARQL algebra docs?

@gkellogg
Copy link
Member

gkellogg commented Nov 5, 2023

The class/method comment references the definition in the algebra section of the grammar spec. https://www.w3.org/TR/sparql11-query/#defn_evalPP_ZeroOrOnePath. It would be the result of parsing a path expression to SSE; there may be a test for it specifically.

I can spend some time on it in the next day or so.

@gkellogg
Copy link
Member

gkellogg commented Nov 6, 2023

invocation of RDF::Query::Solution#==, each time spurring multiple invocations of RDF::URI#== and RDF::Literal#==, the latter of those two being quite expensive.

This is worth its own investigation. Literal comparison has a number of rules regarding exceptions (Type Error).

PathZero is part of PathOpt, thus the reference to the ZeroOrOne path. The problem lies in the bottom-up query execution, where Path* is Path+ union Path0 and Path0 evaluates to the set of nodes in queriable (thus the contains? to remove duplicates where a term is both subject and object). This creates a solution set which is available to the next higher operator.

Your query parses to the following expression:

(prefix
 (
  (skos: <http://www.w3.org/2004/02/skos/core#>)
  (rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>))
 (distinct
  (project
   (?o ?c)
   (join
    (filter (&& (isIRI ?o) (isIRI ?c)) (bgp (triple ?o a ?c)))
    (path <urn:uuid:00ae5130-bb3f-478b-b97a-e868c8bb17d8>
     (reverse (seq (seq skos:memberList (path* rdf:rest)) rdf:first)) ?o )) )) )

The SSE emitted is consistent with that generated by Jena ARQ, but I suspect they have an execution path that does not result in this explosion. I think the solution will be to rewrite path_opt and path_star to factor around the path0 case.

It seems to me that spec/algabra/query_spec.rb distills the important bits from the SPARQL test suite relating to this, so a correct fix will pass the existing tests. To efficiently run against the entire SPARQL test suite requires cloning the rdf-tests and rdf-star repositories so that the symlinks in /spec will resolve the URLs of the test manifests.

I'll continue to spend some time on this.

@doriantaylor
Copy link
Contributor Author

doriantaylor commented Nov 6, 2023

I actually opened a branch on my fork to try rearranging the various comparison functions, particularly in RDF::Literal, before I realized it wasn't the root cause. Will PR if my changes get anything moving markedly faster.

It makes some sense, when both subject and object are variables and there is no additional context, for path0 to select all subjects all out of the graph that are also objects and return a solution mapping to both. Though I don't see an actual SPARQL syntax for path0 so I'm wondering where it came from. With all variables it's like saying SELECT ?a WHERE { ?a ?p ?b . ?c ?q ?a }, i.e. "select all subjects which are also objects".

I agree that path? and path* can probably be rewritten without referencing path0 at all, especially since those are the latter's only consumers and there is no way to directly express it in SPARQL. Heck, you could even keep path0 if there was some way to smuggle existing solutions into it.

I am currently trying to get something out the door but will circle back once I have some spare cycles.

@gkellogg
Copy link
Member

gkellogg commented Nov 7, 2023

I think this basically comes down to the theory of execution of the SPARQL operators, which is inherently bottom-up, meaning that this bottom level (path *rdf:rest) is executed to create solutions which are filtered and merged at the next level. There may be something to be gained by query optimization, but running your query through the ARQ optimizer gives the following:

(base <http://example/base/>
 (prefix ((rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
          (skos: <http://www.w3.org/2004/02/skos/core#>))
   (distinct
     (project (?o ?c)
       (sequence
         (filter (exprlist (isIRI ?o) (isIRI ?c))
           (bgp (triple ?o rdf:type ?c)))
         (sequence
           (bgp
             (triple ??P244 rdf:first <urn:uuid:00ae5130-bb3f-478b-b97a-e868c8bb17d8>)
             (triple ?o skos:memberList ??P245)
           )
           (path ??P245 (path* rdf:rest) ??P244)))))))

While this could improve matters at higher levels, it leaves the same basic solution explosion at the lowest level. Changing the execution model to take solution bindings from the sequence to limit the space that the lowest-level query is run over is probably the way to go, but this looks like a pretty fundamental change, and it's not clear to me how we get there.

Ultimately, there is a reason that there are commercial SPARQL engines that have spent a lot on query optimization, and we simply haven't had the resources to explore this more fully.

@gkellogg
Copy link
Member

gkellogg commented Nov 8, 2023

Some more thoughts on what could eventually happen:

  1. Queries ultimately come down to a series of BGPs, which are handled using rdf::Query#execute. Although unused, this accepts solution bindings. In theory, the 2nd and later queries in a join or sequence can take this set of solution bindings which can make the iteration over the queriable/repository more efficient. Right now, these queries are done without passing bindings, and the merge is done afterwards. It could be more efficient to run with these bindings.

  2. Without doing query rewriting, the current execution path likely could not benefit from this, but a pattern-based optimizer which can turn (path <urn:uuid:00ae5130-bb3f-478b-b97a-e868c8bb17d8> (reverse (seq (seq skos:memberList (path* rdf:rest)) rdf:first)) ?o ) into (sequence (bgp (triple ??P244 rdf:first <urn:uuid:00ae5130-bb3f-478b-b97a-e868c8bb17d8>) (triple ?o skos:memberList ??P245) ) (path ??P245 (path* rdf:rest) ??P244)) could benefit from this. Right now, optimization is limited to re-ordering triple patterns into a query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants