Skip to content

BHLindex is used by Biodiversity Heritage Library to create their scientific names index

License

Notifications You must be signed in to change notification settings

gnames/bhlindex

Repository files navigation

Biodiversity Heritage Library Scientific Names Index (BHLindex)

Doc Status

Creates an index of scientific names occurred in the collection of literature in the Biodiversity Heritage Library

Performance

This application allows to traverse all digitized corpus of Biodiversity Heritage Library in a matter of hours. On a modern high-end laptop we observed the following results:

  • name-finding in 275,000 volumes, 60 million pages: 2.5 hours.
  • name-verification of 23 million unique name-strings: 3 hours.
  • preparing a CSV file with 250 million names occurrences/verification records : 40 minutes.

Installation on Linux

BHL corpus of OCR-ed data can be found as a >50GB compressed file.

Database Preparation

Login to PostgreSQL server and create a database that has the same name as the PgDatabase parameter in the configuration file (default name is bhlindex).

This database will be used to keep found names. The final size of the database upon completion should be in a vicinity of 50 GB.

In the following example we create the database by a postgres superuser and also create a bhl user to operate on the database.

sudo su - postgres
[postgres ~]$ psql
postgres=# create user bhl with password 'my-very-secret-password';
CREATE ROLE
postgres=# create database bhlindex;
CREATE DATABASE
postgres=# grant all privileges on database bhlindex to bhl;
GRANT
postgres=# \c bhlindex
You are now connected to database "bhlindex" as user "postgres".
bhlindex=# alter schema public owner to bhl;
ALTER SCHEMA

The last step is only needed if the bhl user is not set as a superuser. Every database has its own public schema, make sure to change to correct database using \c my-db-name as shown in the example above.

Configuration

When you run the app for the first time it will create a configuration file and will provide information where the file is located (usually it is $HOME/.config/bhlnames.yaml)

Edit the file to provide credentials for PostgreSQL database.

Change the Jobs setting according to the amount of memory and the number of CPU. For 32Gb of memory Jobs: 7 works ok. This parameter sets the number of concurrent jobs running for name-finding.

Set BHLdir parameter to point to the root directory where BHL texts are located (several hundred gigabytes of texts).

Other parameters are optional.

Environment Variables

It is possible to use Environment Variables instead of configuration file. Environment Variables override the configuration file settings. The following variable can be used:

Config Env. Variable
BHLdir BHLI_BHL_DIR
OutputFormat BHLI_OUTPUT_FORMAT
PgHost BHLI_PG_HOST
PgPort BHLI_PG_PORT
PgUser BHLI_PG_USER
PgPass BHLI_PG_PASS
PgDatabase BHLI_PG_DATABASE
Jobs BHLI_JOBS
VerifierURL BHLI_VERIFIER_URL
WithoutConfirm BHLI_WITHOUT_CONFIRM

Usage

Commands

Get BHLindex version

bhlindex -V

Find names in BHL

bhlindex find
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec find -y

Verify detected names using [GNverifier] service

bhlindex verify
# to avoid confirmation dialog (-y overrides configuration file)
bhlindec verify -y

Dump data into tab-separated files

Three files will be created: names, occurrences. They will have extension according to selected output format (CSV is the default). If it is required to filter verified results by data-sources, their list and corresponding IDs can be found at [gnverifier sources page]

Dump files take more than 30GB of space. If --short flag is used, the size is reduced to 13GB.

# Dump files to a designated directory with reduced number of fields,
# and with normalization of verbatim names.
bhlindex-dump -d ~/bhldump -S -N

# Dump files to a designated directory.
bhlindex dump -d ~/bhlindex-dump
# or
bhlindex dump --dir ~/bhlindex-dump

# Dump while creating reduced number of fields making output smaller.
bhlindex dump -S
bhlindex dump --short

# Clean up verbatim names from multiple spaces and characters around the name.
bhlindex dump -N
bhlindex dump --normalize-verbatim

# Dump records verified to particular data-sources of `gnverifier`.
# In this case verified names are filtered by `The Catalogue of Life` (ID=1)
# and `The Encyclopedia of Life` (ID=12).
bhlindex dump -d ~/bhlindex-dump -s 1,12
or
bhlindex dump --dir ~/bhlindex-dump --sources 1,12

# Dump using JSON or TSV formats.
bhlindex dump -f tsv -d ~/bhlindex-dump
bhlindex dump -f json -d ~/bhlindex-dump
#or
bhlindex dump --format tsv --dir ~/bhlindex-dump

To run all commands together

bhlindex find -y && \
  bhlindex verify -y && \
  bhlindex dump -d output-dir

Filtering dumped data

There is a Ruby script filter.rb included into the repository, which traverses the dump files names.csv and occurrences.csv and filters out names that are have more chance to be false positives. Copy the script to a directory with the dump files and run it with:

ruby ./filter.rb

Testing

Testing requires PostgreSQL database bhlindex_test. Testing will delete all data from the test database.

go test ./...

About

BHLindex is used by Biodiversity Heritage Library to create their scientific names index

Resources

License

Stars

Watchers

Forks

Packages

No packages published