Skip to content

Phylogenetic compression of extremely large genome collections [661k β†˜πŸ­πŸ²π—šπ—Άπ—• | BIGSIdata β†˜πŸ°πŸ΄π—šπ—Άπ—• | AllTheBact'23 β†˜πŸ³πŸ±π—šπ—Άπ—•]

License

Notifications You must be signed in to change notification settings

karel-brinda/MiniPhy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

MiniPhy – Minimization via Phylogenetic compression (former MOF-Compress)

Workflow for phylogenetic compression of microbial genomes, producing highly compressed .tar.xz genome archives. MiniPhy first estimates the evolutionary history of user-provided genomes and then uses it for guiding their compression using XZ. The resulting archives can be distributed to users or re-compressed/indexed by other methods. For more information, see the website of phylogenetic compression and the associated paper.


Info Paper DOI GitHub release DOI MiniPhy test

Contents

1. Introduction

The user provides files of files for individual batches in the input/ directory and specifies the requested compression protocols in the configuration file. It is assumed that the input genomes are provided as batches of phylogenetically related genomes, of up to approx. 10k genomes per batch (for more information on batching strategies, see the paper). Upon the execution by make, MiniPhy performs phylogenetic compression of the assemblies or associated de Bruijn graphs. All the compressed outputs and the calculated statistics are then placed in output/.

2. Dependencies

2a. Essential dependencies

and can be installed by Conda by

conda install -c conda-forge -c bioconda -c defaults \
  make "python>=3.7" "snakemake-minimal>=6.2.0" "mamba>=0.20.0"

2b. Protocol-specific dependencies

These are installed automatically by Snakemake when they are requested; for instance, ProPhyle is not installed unless Protocol 3 is used. The specifications of individual environments can be found in workflow/envs/, and they contain: Attotree, ETE 3, SeqTK, xopen, Pandas, Jellyfish 2, ProphAsm, and ProPhyle.

All non-essential dependencies across all protocols can also be installed at once by make conda.

3. Installation

Clone and enter the repository by

git clone https://github.com/karel-brinda/miniphy
cd miniphy

Alternatively, the repository can also be installed using cURL by

mkdir miniphy
cd miniphy
curl -L https://github.com/karel-brinda/miniphy/tarball/main \
    | tar xvf - --strip-components=1

4. Usage

4a. Basic example

  • Step 1: Provide lists of input files.
    For every batch, create a txt list of input files in the input/ directory (i.e., as input/{batch_name}.txt. Use either absolute paths (recommended), or paths relative to the root of the Github repository (not relative to the txt files).

    Such a list can be generated, for instance, by find by

    find ~/dir_with_my_genomes -name '*.fa' > input/my_first_batch.txt

    The supported input file formats include FASTA and FASTQ (possibly compressed by GZip).

  • Step 2 (optional): Provide corresponding phylogenies.
    Instead of estimating phylogenies by Attotree (similar functionality like Mashtree), it is possible to supply custom phylogenies in the Newick format. The tree files should be named input/{batch_name}.nw, and the leave names inside should correspond to FASTA filenames (without FASTA suffixes).

  • Step 3 (optional): Adjust configuration.
    By editing config.yaml it is possible to specify compression protocols, data analyzes, and low-level parameters (see below).

  • Step 4: Run the pipeline.
    Run the pipeline by make; this is run Snakemake with the corresponding parameters.

  • Step 5: Retrieve the output files.
    All output files will be located in output/.

4b. Adjusting configuration

The workflow can be configured via the config.yaml file, and all options are documented directly there. The configurable functionality includes:

  • switching off Conda,
  • protocols to use (asm, dGSs, dBGs with propagation),
  • analyzes to include (sequence and k-mer statistics),
  • k for de Bruijn graph and k-mer counting,
  • Attotree parameters (phylogeny estimation),
  • XZ parameters (low-level compression), or
  • JellyFish parameters (k-mer counting).

4c. List of implemented protocols

Protocol Representation Description Product
ProtocolΒ 1
(default)
Assemblies Left-to-right reordering of the assemblies according to the phylogeny output/asm/{batch}.tar.xz
original assemblies in FASTA (1)
ProtocolΒ 2
(optional)
de Bruijn graphs Simplitigs from individual assemblies, left-to-right reordering of their files output/pre/{batch}.tar.xz
with simplitig text files, representing individual de Bruijn graphs
ProtocolΒ 3
(optional)
de Bruijn graphs Bottom-up k-mer propagation using ProPhyle, simplitigs at individual nodes of the tree, and left-to-right re-ordering of the obtained files output/post/{batch}.tar.xz
output/post/{batch}.nw
simplitig text files per individual nodes of the tree (2)
(1) In FASTA 1-line format and all sequences converted to uppercase (unless switche off in the configuration).
(2) The original de Bruijn graphs can be obtained by merging k-mer sets along the respetive root-to-leaf paths.

4d. List of workflow commands

MiniPhy is executed via GNU Make, which handles all parameters and passes them to Snakemake. Here's a list of all implemented commands (to be executed as make {command}):

######################
## General commands ##
######################
    all                  Run everything (the default subcommand)
    help                 Print help messages
    conda                Create the conda environments
    clean                Clean all output archives and files with statistics
    cleanall             Clean everything but Conda, Snakemake, and input files
    cleanallall          Clean completely everything
###############
## Reporting ##
###############
    viewconf             View configuration without comments
    reports              Create html report
####################
## For developers ##
####################
    test                 Run the workflow on test data (P1)
    bigtest              Run the workflow on test data (P1, P2, P3)
    format               Reformat all source code
    checkformat          Check source code format

Note: make format and make checkformat require YAPF and Snakefmt, which can be installed by conda install -c conda-forge -bioconda yapf snakefmt.

4e. Running on a cluster

Cluster-related parameters for Snakemake can be added via the SMK_CLUSTER_ARGS Make variable.

Example:

make SMK_CLUSTER_ARGS="--profile my_snakemake_cluster_profile"

4f. Troubleshooting

Tests can be run by make test (just Protocol 1) or make bigtest (all the protocols).

5. Citation

K. Brinda, L. Lima, S. Pignotti, N. Quinones-Olvera, K. Salikhov, R. Chikhi, G. Kucherov, Z. Iqbal, and M. Baym. Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression. bioRxiv 2023.04.15.536996, 2023. https://doi.org/10.1101/2023.04.15.536996

@article {PhylogeneticCompression,
   author  = {Karel B{\v r}inda and Leandro Lima and Simone Pignotti
               and Natalia Quinones-Olvera and Kamil Salikhov and Rayan Chikhi
               and Gregory Kucherov and Zamin Iqbal and Michael Baym},
   title   = {Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression},
   journal = {bioRxiv},
   elocation-id = {2023.04.15.536996},
   year    = {2023},
   doi     = {10.1101/2023.04.15.536996},
   url     = {https://www.biorxiv.org/content/early/2023/04/16/2023.04.15.536996}
}

6. Issues

Please use Github issues.

7. Changelog

See Releases.

8. License

MIT

9. Contacts

About

Phylogenetic compression of extremely large genome collections [661k β†˜πŸ­πŸ²π—šπ—Άπ—• | BIGSIdata β†˜πŸ°πŸ΄π—šπ—Άπ—• | AllTheBact'23 β†˜πŸ³πŸ±π—šπ—Άπ—•]

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published