Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression for queries using OPTIONAL and VALUES in 3.7.4/4.1.0 vs 2.5.5 #4232

Open
daltontc opened this issue Oct 12, 2022 · 11 comments · May be fixed by #4330
Open

Performance regression for queries using OPTIONAL and VALUES in 3.7.4/4.1.0 vs 2.5.5 #4232

daltontc opened this issue Oct 12, 2022 · 11 comments · May be fixed by #4330
Assignees
Labels
🐞 bug issue is a bug
Milestone

Comments

@daltontc
Copy link

Current Behavior

In versions 3.7.4 and 4.1.0, iterating over a TupleQueryResult with large results is extremely slow when compared to older RDF4j versions (2.5.5).

With newer RDF4j versions, running the following query over the CHEBI Lite ontology stored in a graph takes over an hour to process results whereas in RDF4j 2.5.5 it only takes 4s.

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>

select ?parent ?child
where {
    ?parent rdf:type owl:Class .
    optional {
        ?child rdfs:subClassOf ?parent ;
               rdf:type owl:Class .
    }
}

See https://github.com/daltontc/rdf4jTest/tree/bug/tuplequeryresults_slow_iteration for an example

Expected Behavior

The time it takes to iterate over large query results should be relatively consistent between versions, if not better in later versions.

Steps To Reproduce

  1. Clone https://github.com/daltontc/rdf4jTest/tree/bug/tuplequeryresults_slow_iteration (requires Git LFS due to large trig file)
  2. Run Main with 2.5.5
  3. Results are iterated over in 4s
Total query time: 4228
Parent Count: 302455
Child Count: 122258
  1. Switch version to 3.7.4 or 4.1.0
  2. Run Main
  3. Results take ages to completely iterate over

Version

3.7.4, 4.1.0

Are you interested in contributing a solution yourself?

No response

Anything else?

No response

@daltontc daltontc added the 🐞 bug issue is a bug label Oct 12, 2022
@github-actions github-actions bot added this to 📥 Inbox in Project Progress Oct 12, 2022
@kenwenzel
Copy link
Contributor

Can you try to use the query explanation feature:
https://rdf4j.org/documentation/programming/repository/#explaining-queries

The long runtime could be related to sub-optimal query optimization.

@daltontc
Copy link
Author

daltontc commented Oct 13, 2022

I let the query actually finish out with 3.7.4 and it took nearly 24 hours. Crazy that the query used to run efficiently in 4 seconds but now takes nearly a full day.

Total query time: 83592008
Parent Count: 302455
Child Count: 122258

Here is the explanation:

Timed out while retrieving explanation! Explanation may be incomplete!
You can change the timeout by setting .setMaxExecutionTime(...) on your query.

Projection (resultSizeActual=38.2K, totalTimeActual=60.0s, selfTimeActual=12.2ms)
╠══ProjectionElemList
║     ProjectionElem "parent"
║     ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator) (resultSizeActual=38.2K, totalTimeActual=59.9s, selfTimeActual=15.3ms)
   ├──Join (JoinIterator) (resultSizeActual=91, totalTimeActual=39.3ms, selfTimeActual=1.82ms)
   │  ╠══BindingSetAssignment ([[type=http://www.w3.org/2002/07/owl#Class], [type=http://www.w3.org/2000/01/rdf-schema#Class]]) (costEstimate=0, resultSizeEstimate=1, resultSizeActual=1, totalTimeActual=0.014ms, selfTimeActual=0.014ms)
   │  ╚══StatementPattern (costEstimate=430, resultSizeEstimate=185.1K, resultSizeActual=91, totalTimeActual=37.4ms, selfTimeActual=37.4ms)
   │        Var (name=parent)
   │        Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
   │        Var (name=type)
   └──Join (JoinIterator) (resultSizeActual=38.2K, totalTimeActual=59.9s, selfTimeActual=10.6s)
      ╠══StatementPattern (costEstimate=215, resultSizeEstimate=185.1K, resultSizeActual=12.1M, totalTimeActual=25.9s, selfTimeActual=25.9s)
      ║     Var (name=child)
      ║     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
      ║     Var (name=type)
      ╚══StatementPattern (costEstimate=1, resultSizeEstimate=191.4K, resultSizeActual=38.2K, totalTimeActual=23.3s, selfTimeActual=23.3s)
            Var (name=child)
            Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
            Var (name=parent)

If I switch the query to remove one of the values in the VALUES declaration, it runs just as slow:

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>

select ?parent ?child
where {
    values ?type {owl:Class}
    ?parent rdf:type ?type .
    optional {
        ?child rdfs:subClassOf ?parent ;
               rdf:type ?type .
    }
}

But by removing the VALUES declaration and subbing in just owl:Class it runs in 4s again.

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>

select ?parent ?child
where {
    ?parent rdf:type owl:Class .
    optional {
        ?child rdfs:subClassOf ?parent ;
               rdf:type owl:Class .
    }
}
Total query time: 4150
Parent Count: 302455
Child Count: 122258

Reworking the query to UNION both different VALUES has it run in the original 4s

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>

select ?parent ?child
where {
    {
        ?parent rdf:type owl:Class .
        optional {
            ?child rdfs:subClassOf ?parent ;
                   rdf:type owl:Class .
        }
    }
    UNION
    {
        ?parent rdf:type rdfs:Class .
        optional {
            ?child rdfs:subClassOf ?parent ;
                   rdf:type rdfs:Class .
        }
    }
}
Total query time: 4332
Parent Count: 302455
Child Count: 122258

It appears that there is an issue with VALUES in the query evaluation.

@hmottestad
Copy link
Contributor

Is it correct that the query you are having performance issues with is using a VALUES clause, but if you remove the VALUES clause then there is no performance difference?

@daltontc
Copy link
Author

@hmottestad I would say that is accurate. These two essentially equivalent queries run in 23 hours and 4 seconds respectively.

prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>

select ?parent ?child
where {
    values ?type {owl:Class}
    ?parent rdf:type ?type .
    optional {
        ?child rdfs:subClassOf ?parent ;
               rdf:type ?type .
    }
}
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix owl: <http://www.w3.org/2002/07/owl#>

select ?parent ?child
where {
    ?parent rdf:type owl:Class .
    optional {
        ?child rdfs:subClassOf ?parent ;
               rdf:type owl:Class .
    }
}

@hmottestad
Copy link
Contributor

@barthanssens barthanssens moved this from 📥 Inbox to 🚧 In progress in Project Progress Oct 17, 2022
@barthanssens barthanssens added this to the 4.2.1 milestone Oct 17, 2022
@hmottestad hmottestad changed the title Extreme Performance Degradation in Large TupleQueryResult Iteration in 3.7.4/4.1.0 vs 2.5.5 Performance regression for queries using OPTIONAL ans VALUES in 3.7.4/4.1.0 vs 2.5.5 Oct 18, 2022
@hmottestad hmottestad changed the title Performance regression for queries using OPTIONAL ans VALUES in 3.7.4/4.1.0 vs 2.5.5 Performance regression for queries using OPTIONAL and VALUES in 3.7.4/4.1.0 vs 2.5.5 Oct 18, 2022
@hmottestad hmottestad modified the milestones: 4.2.1, 4.2.2 Nov 8, 2022
@abrokenjester abrokenjester moved this from 🚧 In progress to 📋 Backlog in Project Progress Nov 19, 2022
@abrokenjester abrokenjester self-assigned this Nov 19, 2022
@abrokenjester abrokenjester moved this from 📋 Backlog to 🚧 In progress in Project Progress Nov 19, 2022
abrokenjester added a commit that referenced this issue Nov 20, 2022
This is the generically correct way to parse VALUES clauses. An
optimizer can potentially look at the ordering in the algebra to push
the values clause down into the join tree (by inspecting which parts of
the tree have variables bound in the VALUES clause).
@abrokenjester
Copy link
Contributor

I just spent a fruitless two hours on this, tricky to see where the problem is. TBC.

@hmottestad hmottestad modified the milestones: 4.2.2, 4.2.3 Dec 1, 2022
abrokenjester added a commit that referenced this issue Dec 17, 2022
This is the generically correct way to parse VALUES clauses. An
optimizer can potentially look at the ordering in the algebra to push
the values clause down into the join tree (by inspecting which parts of
the tree have variables bound in the VALUES clause).
@abrokenjester
Copy link
Contributor

I'm trying to construct a benchmark that reproduces the performance difference, using a synthetic dataset (that hopefully is a little easier to work with than the full dataset provided by @daltontc ). Struggling to show any significant performance difference though. Will try and look closer at the shape of the dataset. @daltontc any insights from your end on how your data is structured? Ideally I'd like to generate a dataset that we can use to reproduce the performance issue, but not quite at the scale of your original set.

abrokenjester added a commit that referenced this issue Dec 18, 2022
This benchmark uses generated data conforming to the query pattern, and
executes performance tests on both the variant with a VALUES clause, and
(as a baseline) the simple equivalent query.

Unfortunately, sofar I have been unable to reproduce any significant performance difference.
@abrokenjester abrokenjester linked a pull request Dec 18, 2022 that will close this issue
5 tasks
@hmottestad
Copy link
Contributor

There is a trig file in the repo that @daltontc linked too. It reproduces the issue for me.

I think it's an issue with the join order from the query planner. I'll explain more in a bit, will also hopefully push a commit to your branch.

@hmottestad
Copy link
Contributor

hmottestad commented Dec 18, 2022

Without VALUES clause

Projection (resultSizeActual=505)
╠══ProjectionElemList
║     ProjectionElem "parent"
║     ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator) (resultSizeActual=505)
   ├──StatementPattern (resultSizeEstimate=504, resultSizeActual=504)
   │     Var (name=parent)
   │     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
   │     Var (name=_const_7e619dcc_uri, value=http://www.w3.org/2002/07/owl#Class, anonymous)
   └──Join (JoinIterator) (resultSizeActual=2)
      ╠══StatementPattern (costEstimate=50, resultSizeEstimate=9.9K, resultSizeActual=2)
      ║     Var (name=child)
      ║     Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
      ║     Var (name=parent)
      ╚══StatementPattern (costEstimate=1, resultSizeEstimate=504, resultSizeActual=2)
-->         Var (name=child) <-- The ?child variable won't be bound until the "?child rdfs:subClassOf ?parent" statement is evaluated
            Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
            Var (name=_const_7e619dcc_uri, value=http://www.w3.org/2002/07/owl#Class, anonymous)

With VALUES clause

Projection
╠══ProjectionElemList
║     ProjectionElem "parent"
║     ProjectionElem "child"
╚══LeftJoin (LeftJoinIterator)
   ├──Join (JoinIterator)
   │  ╠══BindingSetAssignment ([[type=http://www.w3.org/2002/07/owl#Class]]) (costEstimate=1, resultSizeEstimate=1)
   │  ╚══StatementPattern (costEstimate=504, resultSizeEstimate=504)
   │        Var (name=parent)
   │        Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
-->│        Var (name=type, value=http://www.w3.org/2002/07/owl#Class) <--- ?type is able to bind to the ?type below
   └──Join (JoinIterator)
      ╠══StatementPattern (costEstimate=11, resultSizeEstimate=508)
      ║     Var (name=child)
      ║     Var (name=_const_f5e5585a_uri, value=http://www.w3.org/1999/02/22-rdf-syntax-ns#type, anonymous)
-->   ║     Var (name=type) <--- This ?type!
      ╚══StatementPattern (costEstimate=1, resultSizeEstimate=9.9K)
            Var (name=child)
            Var (name=_const_4592be07_uri, value=http://www.w3.org/2000/01/rdf-schema#subClassOf, anonymous)
            Var (name=parent)

Notice that the plans swap the bottom two joins. This is because I've "helped" the join optimizer by adding a lot of [] rdfs:subClassOf [] statements. I've also added 500 [] a owl:Class statements to add some cost to the ?child a ?type query line.

The join optimizer swaps the two last joins because ?type can be bound to the outer join in the query using VALUES.

See the inline comments marked with -->

@hmottestad
Copy link
Contributor

@jeenbroekstra I'm not going to dig any further right now. If you don't have time yourself then maybe I'll pick it up in a few days.

@abrokenjester
Copy link
Contributor

@hmottestad nice work, thanks! I'll try and pick it up again later this week, but feel free to also dig in if you have the time :)

abrokenjester added a commit that referenced this issue Jan 20, 2023
This is the generically correct way to parse VALUES clauses. An
optimizer can potentially look at the ordering in the algebra to push
the values clause down into the join tree (by inspecting which parts of
the tree have variables bound in the VALUES clause).
abrokenjester added a commit that referenced this issue Jan 20, 2023
This benchmark uses generated data conforming to the query pattern, and
executes performance tests on both the variant with a VALUES clause, and
(as a baseline) the simple equivalent query.

Unfortunately, sofar I have been unable to reproduce any significant performance difference.
@hmottestad hmottestad modified the milestones: 4.2.3, 4.2.4 Feb 11, 2023
@abrokenjester abrokenjester modified the milestones: 4.2.4, 4.2.5 Apr 22, 2023
@hmottestad hmottestad modified the milestones: 4.2.5, 4.3.2, 4.3.3 Jun 1, 2023
@hmottestad hmottestad modified the milestones: 4.3.3, 4.3.4 Jul 5, 2023
@hmottestad hmottestad modified the milestones: 4.3.4, 4.3.5 Jul 22, 2023
@hmottestad hmottestad modified the milestones: 4.3.5, 4.3.6 Aug 6, 2023
@hmottestad hmottestad modified the milestones: 4.3.6, 4.3.7 Aug 27, 2023
@hmottestad hmottestad modified the milestones: 4.3.7, 4.3.8 Oct 6, 2023
@hmottestad hmottestad modified the milestones: 4.3.8, 4.3.9 Nov 7, 2023
@hmottestad hmottestad modified the milestones: 4.3.9, 4.3.10 Jan 21, 2024
@hmottestad hmottestad modified the milestones: 4.3.10, 4.3.11 Mar 6, 2024
@hmottestad hmottestad modified the milestones: 4.3.11, 4.3.12 Apr 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug issue is a bug
Projects
No open projects
Project Progress
🚧 In progress
Development

Successfully merging a pull request may close this issue.

5 participants