build mine.py
build-mine.py
is a new mine building script in development for the Gradle-based InterMine 2.0, to replace the previous project-build
script.
As such, it currently only exists in the biotestmine repository. However, if accepted and once debugged, it will be made available for all mines to use.
The script aims to provide a single command for integrate all the data required to produce a mine. As such, it will
- Create databases if necessary (production and intermediate items databases). The userprofiles database is not created by this script (this process has yet to be documented).
- Integrate all the sources listed in the mine's
project.xml
control file, and run the necessary post-processes to create the search index, etc. - Do checkpoint database copys to Postgres itself or the filesystem at points controlled by the
project.xml
file. These checkpoints can later be used by the script to resume an incomplete or failed build.
- First, you need to install the Python dependencies for
build-mine.py
. You can do this with the command
sudo pip3 install -r requirements.txt
- Usage for the script is as follows (run
./build-mine.py --help
for the most up-to-date usage):
usage: Build the mine [-h] [-c CHECKPOINTS_LOCATION] [--dry-run] [--fbt]
mine_properties_path
positional arguments:
mine_properties_path path to the mine's properties file, e.g.
~/.intermine/biotestmine.properties
optional arguments:
-h, --help show this help message and exit
-c CHECKPOINTS_LOCATION, --checkpoints-location CHECKPOINTS_LOCATION
The location for reading/writing database checkpoints
-r, --reset Reset the build. This will delete all existing
checkpoints from the checkpoint location and start the
mine build from the beginning.
--dry-run Dont actually build anything, just show the commands
that would be executed
--fbt, --force-backend-termination
If true, then we will periodically run the postgres
function pg_terminate_backend() to try and clear out
old connections. This may help if InterMine is not
properly closing its connections.
The most straightforward invocation is something like:
$ ./build-mine.py ~/.intermine/biotestmine.properties
This will build the mine using the database connection details in ~/.intermine/biotestmine.properties, invoking the necessary database and Gradle comamnds (instructions for tailoring a *.properties file in the first place are yet ot be written). Checkpoints will be made as independent databases in the same Postgres server as the final production database. If any checkpoint databases are already present, then the one corresponding to the most advanced point in the mine build will be restored, before any following data sources and integrated and post-processes run.
Options are as follows
-
-c, --checkpoints-location if any filesystem path is given, then checkpoints will be database dumps in this path, and any existing dumps will be re-used to restart the mine building process. If this is the special string
:database:
(which is the default), then the database itself will be used for checkpoint data, as discussed above. - -r, --reset if set, then any previous checkpoints in the database or on the filesystem, depending on checkpoints location setting, will be deleted before mine building begins
- --dry-run if set, then commands will be shown but not executed
-
--fbt, --force-backend-termination as above, this will periodically run the postgres command
pg_terminate_backend
to try and close connections (to avoid running out or issues with wiping databases). This may be helpful if an InterMine source is opening many connections but not closing them properly.
If accepted, then this script will need to be made available to all mines. At the moment, this can be done on a test basis by:
$ cp -r build-mine.py requirements.txt interminepy $NEW_MINE
Any updates (e.g. if a bug was found in build-mine.py or interminepy files) would have to be done manually. Also, if the script is being used for the first time, then one has to install the Python modules, though the old project_build Perl script had the same issue.
An alternative would be to put the entirety of build-mine.py
into Pypi and have a user execute
$ sudo pip3 install interminepy
to install both build-mine.py and it's modules. This would remove further dependency installation and make automatic updates possible.
In principle, when first run with some other command (e.g. ./build-mine.py setup
), build-mine.py
could also pull down all the other necessary mine files for user configuration (most immediately project.xml
and mine.properties
). This has some similarities to the approach by projects such as Scrapy. The best approach has yet to be decided.
It's arguably odd to have a Gradle build controlled by a Python script. The immediate reason was issues running multiple-source integration within Gradle. Gradle runs all ant tasks in the same process, whereas the old InterMine 1.x custom Ant-based build system ran some Ant tasks (notably integrate) in separate processes. This means that the Gradle build rapidly runs out of headroom. Possible solutions:
- Keep this Python script for controlling the build. It already exists and (hopefully!) works. It's a bit easier to customize. However, it's also awkward running both Python and Gradle to control the build.
- Fix InterMine not to cause the Gradle process to run out of memory. This is probably something that can be fixed in InterMine itself. This would make InterMine a better engineered and possibly more efficient system. However, the work may be complex and a potential source of bugs - this might better be something considered after an initial 2.0. Doing this would enable the entire build system to be in Gradle, if desired
- Alter Gradle to launch integration processes (though post-processes may also be necessary) as separate processes from Gradle. This is not common practice (scouring the web doesn't give any best practice or even much help) but is certainly doable, though there may be hidden issues. This would allow translation of
build-mine.py
into Gradle.