Skip to content
Justin Clark-Casey edited this page Mar 14, 2018 · 8 revisions

Current State

InterMine is currently using a very old version of Lucene (3.0.2) for its keyword search capability.

For more information on search, please see this Google doc. The relevant parts of that doc should eventually be moved to this wiki page, as this is a much better location for long-lived public technical information and discussion.

The information here is almost certainly incomplete. If you need more, ping me (justincc) on the InterMine mailing list, in Discord chat or wherever.

Discussion/comments/suggested edits to this document are very welcome.

Classes

(these paths may not be completely accurate as they will change between InterMine 1.6 and 2.0)

  • $MINE/dbmodel/resources/keyword_search.properties - per-mine Lucene keyword configuration properties
  • intermine/api/src/main/java/org/intermine/api/lucene/ - the meat of the code to copy the data in InterMine database objects into the Lucene index during the mine build phase. Particularly important classes.
    • InterMineObjectFetcher.java - plays the key role of extracting InterMine objects our of the database and building the Lucene document before placing that on a queue.
    • KeywordSearch.java - pulls documents off the indexing queue and constructs the Lucene index, before stuffing it into the intermine_metadata Postgres table. Also used during InterMine operation to pull the serialized Lucene index out of the intermine_metadata table and reinflate it into RAM. Also methods to search the index (this is all terrible design, this should be 2 separate classes with separate concerns - justincc).
  • intermine/web/main/src/org/intermine/web/search/ - contains classes for displaying search results in the InterMine webapp.
  • intermine/webtasks/main/src/org/intermine/web/task/CreateSearchIndexTask.java - the database build Ant search index task, which invokes the rest of the Lucene index building code.

Operation

There are two phases to InterMine search, the index construction phase and the operational phase. We'll discuss each of these in turn.

Index construction phase

This occurs during the building of the database for a particular InterMine instance (e.g. Wormmine, Flymine, Humanmine, etc.). After the main build process, InterMine runs a series of post-processing tasks. One of these is the create-search-index task which creates the Lucene index. Chief players here are InterMineObjectFetcher which is an independent process that retrieves each InterMineObject and creates a Lucene document that is placed onto a queue (essentially by taking every int and string field from the InterMIneObject and replicated that in the Lucene Document), and KeywordSearch which takes those documents and puts them into a Lucene index, before serialising that index to the intermine_metadata table in the mine's Postgres database, under the search key.

Operational phase

When the index is first required, for example by a user entering a keyword to search or invocation of the search webservice, the index is lazily deserialized from the intermine_metadata table and reconstituted to memory (this is why it takes a long time for this page to load when the webapp has just been restarted, and there are difficulties for users with large indexes). Thereafter, in the case of a query from the webapp, the execute() method executes the search and returns the results to the user.

Other information

Facets

Because InterMine's search code is so ancient, it doesn't use Lucene facets (which I think were introduced around v4), but rather a third-party package called Bobo facets which is no longer maintained. Any updating of the code will need to move search to using Lucene/ElasticSearch/Solr facets.

The future

We can't keep using this ancient version of Lucene, we have to upgrade in some way. I (justincc) currently see 3 choices).

  • Update our embedded Lucene

    • pro - Simplest for out-of-the-box users - no independent process to manage.
    • pro - the simplest solution that's known to work (no need for InterMine/operator to do process managment).
    • con - More sophisticated users are still stuck with an embedded Lucene, they can't put InterMine search data into their own ES/Solr instance with other biological search data, etc.
    • con - Embedding Lucene is not a common solution nowadays, and so will be more difficult to maintain.
    • con - Will prevent use of ES/Solr tooling by ourselves and mine operators (for search analysis, diagnostics, etc.)
    • con - will not prevent problems of slow startup and large memory usage, if we still need to inflate the search index to memory before use.
  • Embed ElasticSearch/Solr like we embed Lucene now

    • pro - Simplest for out-of-the-box users - no independent process to manage.
    • pro - Get to use at least some Solr/ElasticSearch facilities
    • pro - With a bit more code, makes it possible for sophisticated operators to use an external Solr/ES instance.
    • con - Not a supported solution.
    • con - will not prevent problems of slow startup and large memory usage, if we still need to inflate the search index to memory before use.
  • Run ElasticSearch/Solr as an external process

    • pro - Don’t need to deal with embedding Lucene or trying some other solution for out-of-the-box operation by unsophisticated mine operators
    • pro - If we're using an external ES/Solr process, then a user can easily move that search data to another instance later on (e.g. if they run multiple systems using es/solr search).
    • pro - Powerful ElasticSearch/Solr specific capabilikties.
    • con - Operation becomes considerably more compilcated, as ES/Solr now needs to be run as a separate process. There are various solutions here
      1. Leave Solr/ES management entirely to the user. A very strong pro is that this reduces the complexity of InterMine and means we don't have to possibly cover lots of complicated situations. Running Solr/ES is fairly simple, though configuration for InterMine may then be another separate process. A very strong con is that leaving the user to manage ES/Solr degrades the simplicity of using InterMine
      2. Have InterMine automatically configure and manage a container (e.g. Docker) running Solr/ES. Strong pro here, if we can get it right, is keeping the out-of-the-box InterMine experience simple. Strong con is that there's another moving part, and orchestration of a separate container in a way that's invisible to the operator may be pretty complicated.
  • Allow both embedded Lucene or an external Solr/ES

    • pro - user-experience, out-of-the-box users can continue with Lucene whilst sophisticated operators can configure ES/SOlr
    • cons - high code complexity and greater maintenance costs

See #517 for a PR which replaced Lucene with Solr. Unfortunately, at the time we didn't resolve the various issues that came up, and the patch is against InterMine 1.6, whereas any new code would need to be against 2.0 onwards, which is Gradle (as opposed to pure custom Ant) based, and so has a different directory structure. Also, migration to 2.0 is ongoing right now (2018-03) which really doesn't help.