kmindex

kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.

Given a databank $D = {S_1, ..., S_n}$, with each $S_i$ being any genomic dataset (genome or raw reads), kmindex allows to compute the percentage of shared k-mers between a query $Q$ and each $S \in D$. It supports multiple datasets and allows searching for each sub-index $D_i \in G = {D_1,...,D_m}$. Queries benefit from the findere algorithm. In a few words, findere allows to reduce the false positive rate at query time by querying $(s+z)$-mers instead of $s$-mers, which are the indexed words, usually called $k$-mers. kmindex is a tool for querying sequencing samples indexed using kmtricks.

Indexing/Querying example (can be tested in the examples directoy):

Index a dataset:

kmindex build --fof fof1.txt --run-dir D1_index --index ./G --register-as D --hard-min 2 --kmer-size 25 --nb-cell 1000000

Query the index:

kmindex query --index ./G --fastx query.fasta --zvalue 3

Full documentation is available at https://tlemane.github.io/kmindex

Citation Lemane, Téo, et al. "Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA" Nature Computational Science 4.2 (2024): 104-109.

Pre-print paper is available on bioRxiv

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
.github/workflows		.github/workflows
app		app
cmake/modules		cmake/modules
conda/kmindex		conda/kmindex
docker		docker
docs		docs
examples		examples
lib		lib
pykmindex		pykmindex
scripts		scripts
tests		tests
thirdparty		thirdparty
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
install.sh		install.sh

License

tlemane/kmindex

Folders and files

Latest commit

History

Repository files navigation

kmindex

About

Resources

License

Stars

Watchers

Forks

Languages