GitHub - Public-Health-Bioinformatics/kipper: command line versioning tool for large key-value text files

Kipper v1.0.0

The basic command-line file versioning solution we have created primarily for fasta data, called Kipper, is a key-value data store that keeps track of when particular items are inserted or deleted (updates are deletes followed by inserts). It can recall versions by date or version id. It can keep versions in one or more volume files. It currently accepts two kinds of text file input :

"text": any file where each row is a tab/space delimited key-value record.
"fasta" database, see http://en.wikipedia.org/wiki/FASTA_format

Usage

Kipper works off of a [data store name].md metadata file, and one or more [data store name]_[volume id] files. Consequently any Kipper command begins with

kipper.py [data store name]

Alone, this will list the available versions found within a Kipper data store.

By default, all output goes to stdout (screen), and no changes to Kipper data store files are made. Thus by default one sees what would happen if an action were taken, but must take an additional step to affect the data store. The exception to this is with the -M regenerate metadata command described below.

To export a version of a kipper data store, pipe the extract output to a file via '> [file name]' or use the '-o' [file name] parameter. If the "-o" parameter includes a period (e.g. "-o.") rather than a file name, this simply allows Kipper to select the default output file name as appropriate, namely:

For -i --import or -r --revert actions: an update of the Kipper data store.
For -e --extract action: a save of output to the version file specified in the [data store name].md metadata file.

As well, when -o parameter is a path, and not a specific filename, then Kipper saves the appropriate output file name into the given folder. This is convenient for extracting versions into separate folders.

List versions of dbFile key/value pairs (by date/time): -l --list (optional)

kipper.py [data store file]
kipper.py cpn60 -l

Initialize metadata file and kipper file: -M --rebuild

kipper.py [data store file] -M [type of database:text|fasta]
kipper.py cpn60 -M fasta

View metadata (json) file: -m --metadata

kipper.py [data store file] -m
kipper.py cpn60 -m

Import key/value inserts/deletes based on import file (current date used): -i --import

kipper.py [data store file] -i [import file] -o.

Outputs new master database to stdout; doesn't rewrite it.

kipper.py cpn60 -i sequences.fasta

Rewrites cpn60 with new version added.

kipper.py cpn60 -i sequences.fasta -o.

Add Volume to data store (creates new volume file that receives future imports): -V --volume

kipper.py [data store file] -V -o.

Rewrites cpn60 with new volume, and new version added to that volume.

kipper.py cpn60 -V -i sequences.fasta -o.

Extract a version of the file based on given date/time: -e --extract

kipper.py [data store file] -e -d datetime -o [output file]

Extract a version of the file based on given version Id

kipper.py [data store file] -e -n [version id] -o [output file]

Have database revert to previous version. Drops future records, unmarks corresponding deletes: -r --revert

kipper.py [data store file] -r -d datetime -o [output file]

Return version of the kipper code: -v --version

kipper.py -v

Options

-h, --help

Show this help message and exit

-M INITIALIZE, --rebuild=INITIALIZE

(Re)generate metadata file [name of db].md . Provide the type of db [text|fasta| etc.]

-i DB_IMPORT_FILE_PATH, --import=DB_IMPORT_FILE_PATH

Import key/value inserts/deletes based on delta comparison with import file

-e, --extract

Extract a version of the file based on given date/time

-r, --revert

Have database revert to previous version (-d date/time required).  Drops future records, unmarks corresponding deletes.

-m, --metadata

View metadata file [name of db].md

-l, --list

List versions of dbFile key/value pairs (by date/time)

-v, --version

Return version of kipper.py code.

-o DB_OUTPUT_FILE_PATH, --output=DB_OUTPUT_FILE_PATH

Output to this file.  Default is to stdio

-d DATETIME, --date=DATETIME

Provide date/time for sync, extract or revert operations.  Defaults to now.

-u UNIXTIME, --unixTime=UNIXTIME

Provide Unix time (integer) for sync, extract or revert operations.

-n VERSION_ID, --number=VERSION_ID

Provide a version id to extract or revert to.

Performance

Kipper has virtually no memory requirements, regardless of the size of input files. Since fasta databases are mainly just inserts over time, a Kipper data store having many versions usually ends up being modestly larger than the most recent fasta database version size. Its version extraction speed is linear to the time it take to read and write the archive file.

Currently we are experimenting with reading compressed files and writing compressed volume files, to see which archiving format is best.

Example Database

This repo contains a 'RDB_database_load_v10_8-v11_3.sh' script which downloads and imports the RDP RNA database (https://rdp.cme.msu.edu/) versions 10.18 to 11.3 into a kipper data store. This takes a few hours to download and process. The end result is 18 versions of the RDB database that fit into two kipper volumes. The script needs wget and gunzip, and needs a symlink from say '/usr/local/bin/kipper' to the 'kipper.py' executable.

Notes

A change last year in NCBI's coding of alternate descriptions for a fasta sequence has impact on kipper data store file sizes for archives that span that date. It looks like NCBI converted all records last year to get away from the CTRL-A delimiter for alternate fasta descriptions, which now simply includes original chevron. There aren't any CTRL-A characters any more in incoming nt.fasta or nr.fasta files. Consequently one can expect to see manu updates like:

1       3       gi|74|emb|CAA39971.1|   annexin I [Bos taurus]^Agi|264182|gb|AAB25084.1| annexin ...
3               gi|74|emb|CAA39971.1|   annexin I [Bos taurus] >gi|264182|gb|AAB25084.1| annexin ...

Currently there is an undocumented feature in which Kipper can read import data store versions as .gz compressed format. However (when using this tool in conjunction with Biomaj) we have found that python's gzip library doesn't function with large 30G sized archives. So ensure that Biomaj does not gz compress downloaded ncbi nt or nr files before passing them to Kipper.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
RDP-test-case		RDP-test-case
biomaj_utils		biomaj_utils
docs		docs
test-data		test-data
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
kipper.py		kipper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RDP-test-case

RDP-test-case

biomaj_utils

biomaj_utils

docs

docs

test-data

test-data

.gitignore

.gitignore

LICENSE.md

LICENSE.md

README.md

README.md

kipper.py

kipper.py

Repository files navigation

Kipper v1.0.0

Usage

Options

Performance

Example Database

Notes

About

Releases

Packages

Languages

License

Public-Health-Bioinformatics/kipper

Folders and files

Latest commit

History

Repository files navigation

Kipper v1.0.0

Usage

Options

Performance

Example Database

Notes

About

Resources

License

Stars

Watchers

Forks

Languages