kmdiff

Citation

Lemane, Téo, Rayan Chikhi, and Pierre Peterlongo. "kmdiff, large-scale and user-friendly differential k-mer analyses." Bioinformatics 38.24 (2022): 5443-5445

Rationale

kmdiff provides differential k-mers analysis between two populations (control and case). Each population is represented by a set of short-read sequencing. Outputs are differentially represented k-mers between controls and cases.

Statistical methods used in kmdiff are from:

Rahman, Atif, Ingileif Hallgrímsdóttir, Michael Eisen, and Lior Pachter. "Association mapping from sequencing reads using k-mers." Elife 7 (2018): e32920. https://doi.org/10.7554/eLife.32920.001
Mehrab Z, Mobin J, Tahmid IA, Rahman A (2021) Efficient association mapping from k-mers—An application in finding sex-specific sequences. PLOS ONE 16(1): e0245058. https://doi.org/10.1371/journal.pone.0245058
Patterson N, Price AL, Reich D (2006) Population Structure and Eigenanalysis. PLOS Genetics 2(12): e190. https://doi.org/10.1371/journal.pgen.0020190
Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–909 (2006). https://doi.org/10.1038/ng1847

Dependencies

Build dependencies

zlib
bzip2

Population stratification correction needs:

Ubuntu / Debian


sudo apt-get install libgsl-dev libopenblas-dev liblapacke-dev libbz2-dev zlib1g-dev zlib1g

Fedora


sudo dnf install openblas openblas-devel lapack lapack-devel gsl gsl-devel bzip2-devel

Arch


sudo pacman -S lapack lapacke openblas gsl bzip2 zlib

macOS


brew install gsl lapack openblas bzip2 zlib

For convenience, all kmdiff other build dependencies are included in thirdparty directory.

Installation

1. Conda

conda create -p /kmdiff-env
conda activate ./kmdiff-env
conda install -c bioconda -c tlemane kmdiff

2. Build from source

Prerequisites

cmake >= 3.13
gcc >= 8.1 or XCode >= 11.0 or clang >= 7
zlib
bzip2
GSL + Lapacke + OpenBLAS (only with -p 1, see build script).

Clone

git clone --recursive https://github.com/tlemane/kmdiff

Build script

kmdiff build script.
Usage:
  ./install.sh [-r str] [-k LIST[int]] [-t int] [-c int] [-j int] [-s int] [-p] [-e] [-d] [-h]
Options:
  -r <Release|Debug> -> build type {Release}.
  -k <LIST[INT]>     -> k-mer size {"32 64 96 128"}.
  -t <0|1|2>         -> tests: 0 = disabled, 1 = compile, 2 = compile and run {2}.
  -c <1|2|4>         -> byte per count {4}.
  -j <INT>           -> nb threads {8}.
  -s <0|1>           -> population stratification correction 0 = disabled, 1 = enabled {1}
                        (-s 1 requires GSL + lapacke + OpenBLAS)
  -p                 -> compile with plugins support {disabled}
  -e                 -> use conda to install compilers/dependencies {disabled}
  -d                 -> delete cmake cache {disabled}
  -h                 -> show help.

If you are unable to install the prerequisites on your system, use -e. Compilers and build dependencies will thus be provided by a conda environment.

Usage

If you have already built a matrix with kmtricks, you can skip the first step kmdiff count. Note that the --hist parameter is required when constructing the matrix in order to be compatible with kmdiff diff.

1) `kmdiff count` - count k-mers with kmtricks

Input file: one sample per line (controls must appear first)

control1: /path/to/control1_read1.fastq ; /path/to/control1_read2.fastq
control2: /path/to/control2_read1.fastq ; /path/to/control2_read2.fastq
case1: /path/to/case1_read1.fastq ; /path/to/case1_read2.fastq
case2: /path/to/case2_read1.fastq ; /path/to/case2_read2.fastq

Supported files: fasta/fastq, gzipped or not.

Options

kmdiff count v1.1.0

DESCRIPTION
  Count k-mers with kmtricks.

USAGE
  kmdiff count -f/--file <FILE> -d/--run-dir <DIR> [-k/--kmer-size <INT>] [-c/--hard-min <INT>]
               [-r/--recurrence-min <INT>] [--minimizer-type <INT>]
               [--minimizer-size <INT>] [--repartition-type <INT>] [--nb-partitions <INT>]
               [-t/--threads <INT>] [-v/--verbose <STR>] [-h/--help] [--version]

OPTIONS
  [global]
    -f --file           - fof that contains path of read files
    -d --run-dir        - output directory.
    -k --kmer-size      - size of k-mers [8, 127] {31}
    -c --hard-min       - min abundance to keep a k-mer {1}
    -r --recurrence-min - min recurrence to keep a k-mer {1}

  [advanced performance tweaks]
     --minimizer-type   - minimizer type (0=lexi, 1=freq) {0}
     --minimizer-size   - size of minimizer {10}
     --repartition-type - minimizer repartition (0=unordered, 1=ordered) {0}
     --nb-partitions    - number of partitions (0=auto) {0}

  [common]
    -t --threads - number of threads. {8}
    -h --help    - show this message and exit. [⚑]
       --version - show version and exit. [⚑]
    -v --verbose - Verbosity level [debug|info|warning|error]. {info}

2) `kmdiff diff` - aggregate k-mers and dump the significant ones

kmdiff diff v1.1.0

DESCRIPTION
  Differential k-mers analysis.

USAGE
  kmdiff diff -d/--km-run <DIR> -1/--nb-controls <INT> -2/--nb-cases <INT> [-o/--output-dir <DIR>]
              [-s/--significance <FLOAT>] [-u/--cutoff <INT>] [-c/--correction <STR>]
              [--gender <FILE>] [--kmer-pca <FLOAT>] [--ploidy <INT>] [--n-pc <INT>]
              [-t/--threads <INT>] [-v/--verbose <STR>] [-f/--kff-output] [-m/--in-memory]
              [--keep-tmp] [--pop-correction] [-h/--help] [--version]

OPTIONS
  [global]
    -d --km-run       - kmtricks run directory.
    -o --output-dir   - output directory. {./kmdiff_output}
    -1 --nb-controls  - number of controls.
    -2 --nb-cases     - number of cases.
    -s --significance - significance threshold. {0.05}
    -u --cutoff       - Divide the significance threshold by N.
                        Since a large number of k-mers are tested, k-mers with p-values too close to the significance
                        threshold will not pass the last steps of correction.
                        It allows to discard some k-mers a bit earlier and thus save space and time. {100000}
    -c --correction   - significance correction. (bonferroni|benjamini|sidak|holm|disabled) {bonferroni}
    -f --kff-output   - output significant k-mers in kff format. [⚑]
    -m --in-memory    - in-memory correction. [⚑]
       --keep-tmp     - keep tmp files. [⚑]
       --save-sk      - build the matrix of significant k-mers. [⚑]

  [population stratification]
     --pop-correction - apply correction for population stratification. [⚑]
     --gender         - gender file, one sample per line with the id and the gender (M,F,U), space-separated.
     --kmer-pca       - proportion of k-mers used for PCA (in [0.0, 0.05]). {0.001}
     --ploidy         - ploidy level. {2}
     --n-pc           - number of principal components (in [2, 10]). {2}

  [common]
    -t --threads - number of threads. {8}
    -h --help    - show this message and exit. [⚑]
       --version - show version and exit. [⚑]
    -v --verbose - Verbosity level [debug|info|warning|error]. {info}

Outputs

control significant k-mers: <output_dir>/control_kmers.[fasta|kff]
case significant k-mers: <output_dir>/case_kmers.[fasta|kff]

--save-sk: Outputs a matrix with the significant k-mers before correction. You can dump it in text with kmtricks aggregate --run-dir <output-dir>/positive_kmer_matrix --matrix kmer --cpr-in.

Abundances and p-values are provided in fasta headers.

Testing

An example on a small dataset is available here.

Reporting an issue

If you encounter a problem, please open an issue with the return of kmdiff infos, as well as the content of kmdiff-backtrace.log if it exists.

Contact

Téo Lemane: teo[dot]lemane[at]inria[dot]fr Rayan Chikhi: rayan[dot]chikhi[at]pasteur[dot]fr Pierre Peterlongo: pierre[dot]peterlongo[at]inria[dot]fr

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
.github/workflows		.github/workflows
cmake		cmake
conda/kmdiff		conda/kmdiff
docker		docker
examples		examples
include		include
plugins		plugins
scripts		scripts
src		src
tests		tests
thirdparty		thirdparty
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
flake.nix		flake.nix
install.sh		install.sh

License

tlemane/kmdiff

Folders and files

Latest commit

History

Repository files navigation

kmdiff

Citation

Rationale

Dependencies

Build dependencies

Installation

1. Conda

2. Build from source

Prerequisites

Clone

Build script

Usage

1) kmdiff count - count k-mers with kmtricks

2) kmdiff diff - aggregate k-mers and dump the significant ones

Testing

Reporting an issue

Contact

About

Resources

License

Stars

Watchers

Forks

Languages

1) `kmdiff count` - count k-mers with kmtricks

2) `kmdiff diff` - aggregate k-mers and dump the significant ones