Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr-specific query optimizations #96

Open
agazzarini opened this issue Jul 14, 2015 · 6 comments
Open

Solr-specific query optimizations #96

agazzarini opened this issue Jul 14, 2015 · 6 comments
Assignees

Comments

@agazzarini
Copy link
Member

The first implementation step of the Solr-Jena bridge has been actually completed: as suggested by Jena devs, that is basically a Solr-specific implementation of the Jena graph and dataset domain model.

Now, it's time to go ahead with non-functional requirements, efficiency first of all: the default behaviour of Op and related classes (in general I think a lot of things that are in charge to manage the query algebra and execution) needs to be adapted / specialized in order to provide Solr-specific optimizations.

As I almost ignorant about those topics, I'm trying to study them, but I believe it will take me a bit of time. If there's someone who is more expert than me (very easy) or simply wants to join this adventure, feel free to give me a shout ;)

@agazzarini
Copy link
Member Author

The first step is a Solr-specific implementation of OpBGP and corresponding execution plan.

An idea (that I'm testing) is:

  • run a (separate) filter query for each triple pattern
  • take the DocSet with the lowest cardinality
  • intersects that one with the second pattern, the result with the third pattern and so on

In this way the total number of operations needed should be smaller than the current (default) implementation.

@agazzarini
Copy link
Member Author

A great step ahead: I created the first working version of the Jena StageGenerator, which is in charge to execute and resolve Basic Graph Patterns (BGPs), the SPARQL building blocks.

It leverages low-level Solr / Lucene stuff in order to speed up and optimize the patterns execution. At a first glance, I see good results so it seems the idea could work. However, I need

  • to structure / refactor the whole thing in order to end with a decent design
  • to make working the whole integration suite (standalone and SolrCloud mode)
  • to run some benchmarks with a consistent set of triples.

@agazzarini
Copy link
Member Author

The stuff above has been committed in a dedicated branch - issue_89 - so it's not in the master

@agazzarini
Copy link
Member Author

Still a lot of things to do. I'm trying to build a bridge between the Jena Op / OpExecutor framework and the Solr world. The general and overall iterator behaviour of Jena classes (i.e. QueryIterator) sometimes doesn't fit very well with the Solr logic especially when a lot of members participate in the query execution plan. Something, for example, like this:

(project (?first ?last ?workTel)
  (conditional
    (filter (> ?amount 10000)
      (bgp
        (triple ?s <http://learningsparql.com/ns/addressbook#firstName> ?first)
        (triple ?s <http://learningsparql.com/ns/addressbook#lastName> ?last)
        (triple ?s <http://learningsparql.com/ns/addressbook#portfolio> ?amount)
      ))
    (bgp (triple ?s <http://learningsparql.com/ns/addressbook#workTel> ?workTel))))

project (?first ?last ?workTel)
  (filter (> ?amount 10000)
    (leftjoin
      (bgp
        (triple ?s <http://learningsparql.com/ns/addressbook#firstName> ?first)
        (triple ?s <http://learningsparql.com/ns/addressbook#lastName> ?last)
        (triple ?s <http://learningsparql.com/ns/addressbook#portfolio> ?amount)
      )
      (bgp (triple ?s <http://learningsparql.com/ns/addressbook#workTel> ?workTel)))))

So what I'm trying to do is a new set of classes that act as reducers from a given algebra expression to a Solr DocSet. These classes also needs to implement the Jena QueryIterator interface in a lazy way....that is: when Jena asks for Bindings or QuerySolutions they will produce them on-demand. Before of that, they will work only with Solr / Lucene data model, optimizing and compacting the operations according with the corresponding query parser capabilities.

@agazzarini
Copy link
Member Author

A first implementation of Basic Graph Pattern execution seems working. It works directly at Lucene low-level, executing subsequent joins between docsets (resulting from each triple pattern in the graph).

Again, the underlying idea seems working but needs some more time: I tried running the integration suite and there are some expected failures (but also a lot of green tests) so the issue_89 branch is definitely unstable.

@agazzarini agazzarini changed the title Query Solr-specific optimizations Solr-specific query optimizations Aug 15, 2015
agazzarini pushed a commit that referenced this issue Aug 20, 2015
agazzarini pushed a commit that referenced this issue Aug 22, 2015
agazzarini pushed a commit that referenced this issue Aug 23, 2015
failures / 8 errors, mainly expected ClassCastException)
@agazzarini
Copy link
Member Author

The issue_89 branch contains a rough implementation of

  • BGP executor (QueryIterator) that works with the most part of BGP integration tests
  • Filter executor that injects the filter directly in the BGP executor instead of decorating it
  • Conditional executor, which compares two lazy BGP (not really satisfied about the implementation, but it's working)

There are still 14 failures and 8 errors in the SELECT tests. They are mainly

  • ClassCastException as I haven't implemented all Op* so sometimes I (wrongly) assume the concrete instance of a given Operation
  • related with filters and functions: I need a more general bridge betweem the Jena functions and the Solr filters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant