Skip to content

tlemane/kmindex

Repository files navigation

kmindex

License kmindex kmindex-osx release dockerhub anaconda

kmindex is a tool for indexing and querying sequencing samples. It is built on top of kmtricks.

Given a databank $D = {S_1, ..., S_n}$, with each $S_i$ being any genomic dataset (genome or raw reads), kmindex allows to compute the percentage of shared k-mers between a query $Q$ and each $S \in D$. It supports multiple datasets and allows searching for each sub-index $D_i \in G = {D_1,...,D_m}$. Queries benefit from the findere algorithm. In a few words, findere allows to reduce the false positive rate at query time by querying $(s+z)$-mers instead of $s$-mers, which are the indexed words, usually called $k$-mers. kmindex is a tool for querying sequencing samples indexed using kmtricks.

Indexing/Querying example (can be tested in the examples directoy):

  1. Index a dataset:
kmindex build --fof fof1.txt --run-dir D1_index --index ./G --register-as D --hard-min 2 --kmer-size 25 --nb-cell 1000000
  1. Query the index:
kmindex query --index ./G --fastx query.fasta --zvalue 3

Full documentation is available at https://tlemane.github.io/kmindex

Citation Lemane, Téo, et al. "Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA" Nature Computational Science 4.2 (2024): 104-109.

Pre-print paper is available on bioRxiv