Skip to content

build mine.py

Justin Clark-Casey edited this page Mar 8, 2018 · 4 revisions

Introduction

build-mine.py is a new mine building script in development for the Gradle-based InterMine 2.0, to replace the previous project-build script.

As such, it currently only exists in the biotestmine repository. However, if accepted and once debugged, it will be made available for all mines to use.

The script aims to provide a single command for integrate all the data required to produce a mine. As such, it will

  • Create databases if necessary (production and intermediate items databases). The userprofiles database is not created by this script (this process has yet to be documented).
  • Integrate all the sources listed in the mine's project.xml control file, and run the necessary post-processes to create the search index, etc.
  • Do checkpoint database copys to Postgres itself or the filesystem at points controlled by the project.xml file. These checkpoints can later be used by the script to resume an incomplete or failed build.

Usage

  1. First, you need to install the Python dependencies for build-mine.py. You can do this with the command

sudo pip3 install -r requirements.txt

  1. Usage for the script is as follows (run ./build-mine.py --help for the most up-to-date usage):
usage: Build the mine [-h] [-c CHECKPOINTS_LOCATION] [--dry-run] [--fbt]
                      mine_properties_path

positional arguments:
  mine_properties_path  path to the mine's properties file, e.g.
                        ~/.intermine/biotestmine.properties

optional arguments:
  -h, --help            show this help message and exit
  -c CHECKPOINTS_LOCATION, --checkpoints-location CHECKPOINTS_LOCATION
                        The location for reading/writing database checkpoints
  -r, --reset           Reset the build. This will delete all existing
                        checkpoints from the checkpoint location and start the
                        mine build from the beginning.
  --dry-run             Dont actually build anything, just show the commands
                        that would be executed
  --fbt, --force-backend-termination
                        If true, then we will periodically run the postgres
                        function pg_terminate_backend() to try and clear out
                        old connections. This may help if InterMine is not
                        properly closing its connections.

The most straightforward invocation is something like:

$ ./build-mine.py ~/.intermine/biotestmine.properties

This will build the mine using the database connection details in ~/.intermine/biotestmine.properties, invoking the necessary database and Gradle comamnds (instructions for tailoring a *.properties file in the first place are yet ot be written). Checkpoints will be made as independent databases in the same Postgres server as the final production database. If any checkpoint databases are already present, then the one corresponding to the most advanced point in the mine build will be restored, before any following data sources and integrated and post-processes run.

Options are as follows

  • -c, --checkpoints-location if any filesystem path is given, then checkpoints will be database dumps in this path, and any existing dumps will be re-used to restart the mine building process. If this is the special string :database: (which is the default), then the database itself will be used for checkpoint data, as discussed above.
  • -r, --reset if set, then any previous checkpoints in the database or on the filesystem, depending on checkpoints location setting, will be deleted before mine building begins
  • --dry-run if set, then commands will be shown but not executed
  • --fbt, --force-backend-termination as above, this will periodically run the postgres command pg_terminate_backend to try and close connections (to avoid running out or issues with wiping databases). This may be helpful if an InterMine source is opening many connections but not closing them properly.

Future deployment

If accepted, then this script will need to be made available to all mines. At the moment, this can be done on a test basis by:

$ cp -r build-mine.py requirements.txt interminepy $NEW_MINE

Any updates (e.g. if a bug was found in build-mine.py or interminepy files) would have to be done manually. Also, if the script is being used for the first time, then one has to install the Python modules, though the old project_build Perl script had the same issue.

An alternative would be to put the entirety of build-mine.py into Pypi and have a user execute

$ sudo pip3 install interminepy

to install both build-mine.py and it's modules. This would remove further dependency installation and make automatic updates possible.

In principle, when first run with some other command (e.g. ./build-mine.py setup), build-mine.py could also pull down all the other necessary mine files for user configuration (most immediately project.xml and mine.properties). This has some similarities to the approach by projects such as Scrapy. The best approach has yet to be decided.

Development discussion

It's arguably odd to have a Gradle build controlled by a Python script. The immediate reason was issues running multiple-source integration within Gradle. Gradle runs all ant tasks in the same process, whereas the old InterMine 1.x custom Ant-based build system ran some Ant tasks (notably integrate) in separate processes. This means that the Gradle build rapidly runs out of headroom. Possible solutions:

  • Keep this Python script for controlling the build. It already exists and (hopefully!) works. It's a bit easier to customize. However, it's also awkward running both Python and Gradle to control the build.
  • Fix InterMine not to cause the Gradle process to run out of memory. This is probably something that can be fixed in InterMine itself. This would make InterMine a better engineered and possibly more efficient system. However, the work may be complex and a potential source of bugs - this might better be something considered after an initial 2.0. Doing this would enable the entire build system to be in Gradle, if desired
  • Alter Gradle to launch integration processes (though post-processes may also be necessary) as separate processes from Gradle. This is not common practice (scouring the web doesn't give any best practice or even much help) but is certainly doable, though there may be hidden issues. This would allow translation of build-mine.py into Gradle.
Clone this wiki locally