EToKi (Enterobase Tool Kit)

All methods related to Enterobase data analysis pipelines.

INSTALLATION:

EToKi was developed and tested in both Python 2.7 and Python 3.5. EToKi depends on several Python libraries:

ete3
numba
numpy
pandas
psutil
sklearn

All libraries can be installed using pip:

pip install ete3 numba numpy pandas sklearn psutil

EToKi also calls the following 3rd party programs for different pipelines:

raxml
fasttree
rapidnj
bbmap
mmseqs
ncbi-blast
usearch
spades
megahit
samtools
pilon
gatk
bwa
bowtie2
minimap2
kraken2 & minikraken2
lastal & lastdb
pilercr
trf

All 3rd party programs except for usearch can be automatically installed using configure command:

python EToKi.py configure --install --download_krakenDB

NOTE: This has only been tested in Ubutu 16.06 but is expected to run on other 64-bit Linux systems.

Usearch is a commercial program and allows free use of the 32-bit version for individuals. Please download it from https://www.drive5.com/usearch/

After it is downloaded, pass its executable file to EToKi using --usearch

python EToKi.py configure --usearch /path/to/usearch

You can also run both --install and --usearch at the same time:

python EToKi.py configure --install --download_krakenDB --usearch /path/to/usearch

Note that --download_krakenDB will download the minikraken2 database, which is about 8GB in size. Alternatively, you can use --link_krakenDB to pass a different Kraken database to EToKi.

python EToKi.py configure --install --link_krakenDB /path/to/krakenDB --usearch /path/to/usearch

You can also use pre-installed 3rd party programs in EToKi, by passing their absolute paths into the program using --path. This argument can be specified multiple times in the same command:

python EToKi.py configure --path fasttree=/path/to/fasttree --path raxml=/path/to/raxml

Quick Start (with examples)

Trim genomic reads

python EToKi.py prepare --pe examples/S_R1.fastq.gz,examples/S_R2.fastq.gz -p examples/prep_out

Merge and trim metagenomic reads

python EToKi.py prepare --pe examples/S_R1.fastq.gz,examples/S_R2.fastq.gz -p examples/meta_out --noRename --merge

Assemble genomic reads using SPAdes

python EToKi.py assemble --pe examples/prep_out_L1_R1.fastq.gz,examples/prep_out_L1_R2.fastq.gz --se examples/prep_out_L1_SE.fastq.gz -p examples/asm_out

Assemble genomic reads using MEGAHIT

python EToKi.py assemble --se examples/meta_out_L1_MP.fastq.gz \
--pe examples/meta_out_L1_R1.fastq.gz,examples/meta_out_L1_R2.fastq.gz --se examples/meta_out_L1_SE.fastq.gz \
-p examples/asm_out2 --assembler megahit

Map reads onto reference, with pre-filtering with ingroups and outgroups

python EToKi.py assemble --se examples/meta_out_L1_MP.fastq.gz --metagenome \
--pe examples/meta_out_L1_R1.fastq.gz,examples/meta_out_L1_R2.fastq.gz --se examples/meta_out_L1_SE.fastq.gz \
-p examples/map_out -r examples/GCF_000010485.1_ASM1048v1_genomic.fna.gz \
-i examples/GCF_000214765.2_ASM21476v3_genomic.fna.gz -o examples/GCF_000005845.2_ASM584v2_genomic.fna.gz

Prepare reference alleles and a local database for 7 Gene MLST scheme

python EToKi.py MLSTdb -i examples/Escherichia.Achtman.alleles.fasta -r examples/Escherichia.Achtman.references.fasta -d examples/Escherichia.Achtman.convert.tab

Calculate 7 Gene MLST genotype for a queried genome

gzip -cd examples/GCF_001566635.1_ASM156663v1_genomic.fna.gz > examples/GCF_001566635.1_ASM156663v1_genomic.fna && \
python EToKi.py MLSType -i examples/GCF_001566635.1_ASM156663v1_genomic.fna -r examples/Escherichia.Achtman.references.fasta -k G749 -o stdout -d examples/Escherichia.Achtman.convert.tab

Run EBEis (EnteroBase Escherichia in silico serotyping)

python EToKi.py EBEis -t Escherichia -q examples/GCF_000010485.1_ASM1048v1_genomic.fna -p SE15

Cluster sequences into similarity-based groups

python EToKi.py clust -p examples/Escherichia.Achtman.alleles_clust -i examples/Escherichia.Achtman.alleles.fasta -d 0.95 -c 0.95

Do a joint BLASTn-like search using BLASTn, uSearch (uBLASTp), Mimimap and mmseqs

python EToKi.py uberBlast -q examples/Escherichia.Achtman.alleles.fasta -r examples/GCF_001566635.1_ASM156663v1_genomic.fna -o examples/G749_7Gene.bsn --blastn --ublast --minimap --mmseq -s 2 -f

align multiple genomes onto one reference

python EToKi.py align -r GCF_000010485:examples/GCF_000010485.1_ASM1048v1_genomic.fna.gz -p examples/phylo_out \
GCF_000005845:examples/GCF_000005845.2_ASM584v2_genomic.fna.gz \
GCF_000214765:examples/GCF_000214765.2_ASM21476v3_genomic.fna.gz \
GCF_001566635:examples/GCF_001566635.1_ASM156663v1_genomic.fna.gz

Build ML tree using RAxML and place all SNPs onto branches in the tree

cd examples && python ../EToKi.py phylo -t snp2mut -p phylo_out -s phylo_out.matrix.gz --ng && cd ..

USAGE:

The first argument passed into EToKi specifies the command to be called and the rest are the parameters for that command. To see all the commands available in EToKi, use

python EToKi.py -h

And to see the parameters for an individual command, use:

EToKi.py <command> -h

configure - install and/or configure 3rd party programs

See the INSTALL section or the help page below.

usage: EToKi.py configure [-h] [--install] [--usearch USEARCH]
                          [--download_krakenDB]
                          [--link_krakenDB KRAKEN_DATABASE] [--path PATH]

Install or modify the 3rd party programs.

optional arguments:
  -h, --help            show this help message and exit
  --install             install 3rd party programs
  --usearch USEARCH     usearch is required for ortho and MLSType. A 32-bit
                        version of usearch can be downloaded from
                        https://www.drive5.com/usearch/
  --download_krakenDB   When specified, miniKraken2 (8GB) will be downloaded
                        into the EToKi folder. You can also use
                        --link_krakenDB to use a pre-installed kraken2
                        database.
  --link_krakenDB KRAKEN_DATABASE
                        Kraken is optional in the assemble module. You can
                        specify your own database here
  --path PATH, -p PATH  Specify path to the 3rd party programs manually.
                        Format: <program>=<path>. This parameter can be
                        specified multiple times

prepare - trim, collapse, downsize and rename the short reads

usage: EToKi.py prepare [-h] [--pe PE] [--se SE] [-p PREFIX] [-q READ_QUAL]
                        [-b MAX_BASE] [-m MEMORY] [--noTrim] [--merge]
                        [--noRename]

EToKi.py prepare
(1) Concatenates reads of the same library together.
(2) Merge pair-end sequences for metagenomic reads (bbmap).
(3) Trims sequences based on base-qualities (bbduk).
(4) Removes potential adapters and barcodes (bbduk).
(5) Limits total amount of reads to be used.
(6) Renames reads using sequential numbers.

optional arguments:
  -h, --help            show this help message and exit
  --pe PE               comma delimited files of PE reads from the same library.
                        e.g. --pe a_R1.fq.gz,a_R2.fq.gz,b_R1.fq.gz,b_R2.fq.gz
                        This can be specified multiple times for different libraries.
  --se SE               comma delimited files of SE reads from the same library.
                        e.g. --se c_SE.fq.gz,d_SE.fq.gz
                        This can be specified multiple times for different libraries.
  -p PREFIX, --prefix PREFIX
                        prefix for the outputs. Default: EToKi_prepare
  -q READ_QUAL, --read_qual READ_QUAL
                        Minimum quality to be kept in bbduk. Default: 6
  -b MAX_BASE, --max_base MAX_BASE
                        Total amount of bases (in BPs) to be kept.
                        Default as -1 for no restriction.
                        Suggest to use ~100X coverage for de novo assembly.
  -m MEMORY, --memory MEMORY
                        maximum amount of memory to be used in bbduk. Default: 30g
  --noTrim              Do not do quality trim using bbduk
  --merge               Try to merge PE reads by their overlaps using bbmap
  --noRename            Do not rename reads

assemble - de novo or reference-guided assembly for genomic or metagenomic reads

EToKi assemble is a joint method for both de novo assembly and reference-guided assembly.

de novo assembly approach calls either SPAdes (default) or MEGAHIT (default for metagenomic data) on short reads that have been cleaned up using EToKi prepare, and uses Pilon to polish the assembled scaffolds and evaluate the reliability of consensus bases of the scaffolds.
Reference-guided assembly is also called "reference mapping". Short reads are aligned to a user-specified reference genome using minimap2. Nucleotide bases of the reference genome are updated using Pilon, according to the consensus base calls of the covered reads. Non-specific metagenomic reads of closely related species can sometimes also align to the reference genome and confuse consensus calling. Two arguments, --outgroup and --ingroup, are given to pre-filter these non-specific reads and obtain clean SNP calls.

usage: EToKi.py assemble [-h] [--pe PE] [--se SE] [--pacbio PACBIO] [--ont ONT] [-p PREFIX] [-a ASSEMBLER] [-r REFERENCE] [-k KMERS] [-m MAPPER] [-d MAX_DIFF] [-i INGROUP] [-o OUTGROUP] [-S SNP] [-c CONT_DEPTH]
                         [--excluded EXCLUDED] [--metagenome] [--numPolish NUMPOLISH] [--reassemble] [--onlySNP] [--noQuality] [--onlyEval] [--kraken]

EToKi.py assemble
(1.1) Assembles short reads into assemblies, or
(1.2) Maps them onto a reference.
And
(2) Polishes consensus using polish,
(3) Removes low level contaminations.
(4) Estimates the base quality of the consensus.
(5) Predicts taxonomy using Kraken.

optional arguments:
  -h, --help            show this help message and exit
  --pe PE               comma delimited two files of PE reads.
  --se SE               one file of SE read.
  --pacbio PACBIO       one file of pacbio read.
  --ont ONT             one file of nanopore read.
  -p PREFIX, --prefix PREFIX
                        prefix for the outputs. Default: EToKi_assemble
  -a ASSEMBLER, --assembler ASSEMBLER
                        Assembler used for de novo assembly.
                        Disabled if you specify a reference.
                        Default: spades for single colony isolates, megahit for metagenome.
                         Long reads will always be assembled with Flye
  -r REFERENCE, --reference REFERENCE
                        Reference for read mapping. Specify this for reference mapping module.
  -k KMERS, --kmers KMERS
                        relative lengths of kmers used in SPAdes. Default: 30,50,70,90
  -m MAPPER, --mapper MAPPER
                        aligner used for read mapping.
                        options are: miminap (default), bwa or bowtie2
  -d MAX_DIFF, --max_diff MAX_DIFF
                        Maximum proportion of variations allowed for a aligned reads.
                        Default: 0.1 for single isolates, 0.05 for metagenome
  -i INGROUP, --ingroup INGROUP
                        Additional references presenting intra-population genetic diversities.
  -o OUTGROUP, --outgroup OUTGROUP
                        Additional references presenting genetic diversities outside of the studied population.
                        Reads that are more similar to outgroups will be excluded from analysis.
  -S SNP, --SNP SNP     Exclusive set of SNPs. This will overwrite the polish process.
                        Required format:
                        <cont_name> <site> <base_type>
                        ...
  -c CONT_DEPTH, --cont_depth CONT_DEPTH
                        Allowed range of read depth variations relative to average value.
                        Default: 0.2,2.5
                        Contigs with read depths outside of this range will be removed from the final assembly.
  --excluded EXCLUDED   A name of the file that contains reads to be excluded from the analysis.
  --metagenome          Reads are from metagenomic samples
  --numPolish NUMPOLISH
                        Number of Pilon polish iterations. Default: 1
  --reassemble          Do local re-assembly in PILON. Suggest to use this flag with long reads.
  --onlySNP             Only modify substitutions during the PILON polish.
  --noQuality           Do not estimate base qualities.
  --onlyEval            Do not run assembly/mapping. Only evaluate assembly status.
  --kraken              Run kmer based species predicton on contigs.

ortho - pan-genome (and wgMLST scheme) prediction

EToKi ortho has now been migrated to a separate repository and renamed as PEPPA.

MLSTdb - Set up exemplar alleles and database for MLST schemes

EToKi MLSTdb converts existing allelic sequences into two files: (1) a multi-fasta file of exemplar allelic sequences and (2) a lookup table for the EToKi MLSType method.

The exemplar alleles are defined as:
1. Over 40% identity to the allelic sequences of a reference genome specified by --refstrain
2. Less than 90% identity between different exemplar sequences of the same locus
3. Identity to sequences of any different locus that is at least 10% less than the similarity to sequences of the same locus.

usage: EToKi.py MLSTdb [-h] -i ALLELEFASTA [-r REFSET] [-d DATABASE]
                       [-s REFSTRAIN] [-x MAX_IDEN] [-m MIN_IDEN] [-p PARALOG]
                       [-c COVERAGE] [-e]

MLSTdb. Create reference sets of alleles for nomenclature.

optional arguments:
  -h, --help            show this help message and exit
  -i ALLELEFASTA, --input ALLELEFASTA
                        [REQUIRED] A single file contains all known alleles in
                        a MLST scheme.
  -r REFSET, --refset REFSET
                        [DEFAULT: No ref allele] Output - Reference alleles
                        used for MLSType.
  -d DATABASE, --database DATABASE
                        [DEFAULT: No allele DB] Output - A lookup table of all
                        alleles.
  -s REFSTRAIN, --refstrain REFSTRAIN
                        [DEFAULT: None] A single file contains alleles from
                        the reference genome.
  -x MAX_IDEN, --max_iden MAX_IDEN
                        [DEFAULT: 0.9 ] Maximum identities between resulting
                        refAlleles.
  -m MIN_IDEN, --min_iden MIN_IDEN
                        [DEFAULT: 0.4 ] Minimum identities between refstrain
                        and resulting refAlleles.
  -p PARALOG, --paralog PARALOG
                        [DEFAULT: 0.1 ] Minimum differences between difference
                        loci.
  -c COVERAGE, --coverage COVERAGE
                        [DEFAULT: 0.7 ] Proportion of aligned regions between
                        alleles.
  -e, --relaxEnd        [DEFAULT: False ] Allow changed ends (for pubmlst).

MLSType - MLST nomenclature using a local set of references

EToKi MLSType identities allelic sequences in a queried genome, by comparing it with the exemplar alleles generated by MLSTdb.

usage: EToKi.py MLSType [-h] -i GENOME -r REFALLELE -k UNIQUE_KEY
                       [-d DATABASE] [-o OUTPUT] [-q] [-f] [-m MIN_IDEN]
                       [-p MIN_FRAG_PROP] [-l MIN_FRAG_LEN] [-x INTERGENIC]
                       [--overlap_prop OVERLAP_PROP]
                       [--overlap_iden OVERLAP_IDEN] [--max_dist MAX_DIST]
                       [--diag_diff DIAG_DIFF] [--max_diff MAX_DIFF]

MLSType. Find and designate MLST alleles from a queried assembly.

optional arguments:
 -h, --help            show this help message and exit
 -i GENOME, --genome GENOME
                       [REQUIRED] Input - filename for genomic assembly.
 -r REFALLELE, --refAllele REFALLELE
                       [REQUIRED] Input - fasta file for reference alleles.
 -k UNIQUE_KEY, --unique_key UNIQUE_KEY
                       [REQUIRED] An unique identifier for the assembly.
 -d DATABASE, --database DATABASE
                       [OPTIONAL] Input - lookup table of existing alleles.
 -o OUTPUT, --output OUTPUT
                       [DEFAULT: No output] Output - filename for the
                       generated alleles. Specify to STDOUT for screen
                       output.
 -q, --query_only      [DEFAULT: False] Do not submit new allele, only query.
 -f, --force           [DEFAULT: False] Force to accept low quality alleles.
 -m MIN_IDEN, --min_iden MIN_IDEN
                       [DEFAULT: 0.65 ] Minimum identities between refAllele
                       and genome.
 -p MIN_FRAG_PROP, --min_frag_prop MIN_FRAG_PROP
                       [DEFAULT: 0.6 ] Minimum covereage of a fragment.
 -l MIN_FRAG_LEN, --min_frag_len MIN_FRAG_LEN
                       [DEFAULT: 50 ] Minimum length of a fragment.
 -x INTERGENIC, --intergenic INTERGENIC
                       [DEFAULT: -1,-1 ] Call alleles in intergenic region if
                       the distance between two closely located loci fall
                       within the range defined by the two numbers. Suggest
                       to use 50,500. This is diabled by default with minus
                       numbers.
 --overlap_prop OVERLAP_PROP
                       [DEFAULT: 0.5 ] Given two hits, if <overlap_prop> of
                       their regions overlap, and the sequence identities of
                       one hits is <overlap_iden> lower than the other. The
                       hit with lower identities will be removed.
 --overlap_iden OVERLAP_IDEN
                       [DEFAULT: 0.05 ] Given two hits, if <overlap_prop> of
                       their regions overlap, and the sequence identities of
                       one hits is <overlap_iden> lower than the other. The
                       hit with lower identities will be removed.
 --max_dist MAX_DIST   [DEFAULT: 300 ] Consider two closely located hits as a
                       synteny block if their coordinates in both queried
                       genomes and reference gene are seperated by no more
                       than <max_dist> bps.
 --diag_diff DIAG_DIFF
                       [DEFAULT: 1.2 ] Consider two closely located hits as a
                       synteny block if, after merged, its covered region in
                       the queried genome is no more than <diag_diff> folds
                       of the region in the reference gene.
 --max_diff MAX_DIFF   [DEFAULT: 200 ] Consider two closely located hits as a
                       synteny block if, after merged, the lengths of its
                       covered regions in the queried genome and the
                       reference gene are differed by no more than <max_diff>
                       bps.

align - align multiple queried genomes to a single reference

usage: EToKi.py align [-h] -r REFERENCE [-p PREFIX] [-a] [-m] [-l] [-c CORE]
                      [-n N_PROC]
                      queries [queries ...]

Align multiple genomes onto a single reference.

positional arguments:
  queries               queried genomes. Use <Tag>:<Filename> format to feed
                        in a tag for each genome. Otherwise filenames will be
                        used as tags for genomes.

optional arguments:
  -h, --help            show this help message and exit
  -r REFERENCE, --reference REFERENCE
                        [REQUIRED; INPUT] reference genomes to be aligned
                        against. Use <Tag>:<Filename> format to assign a tag
                        to the reference.
  -p PREFIX, --prefix PREFIX
                        [OUTPUT] prefix for all outputs.
  -a, --alignment       [OUTPUT] Generate core genomic alignments in FASTA
                        format
  -m, --matrix          [OUTPUT] Do not generate core SNP matrix
  -l, --last            Activate to use LAST as aligner. [DEFAULT: minimap2]
  -c CORE, --core CORE  [PARAM] percentage of presences for core genome.
                        [DEFAULT: 0.95]
  -n N_PROC, --n_proc N_PROC
                        [PARAM] number of processes to use. [DEFAULT: 5]

phylo - infer phylogeny and ancestral states from genomic alignments

usage: EToKi.py phylo [-h] [--tasks TASKS] --prefix PREFIX
                      [--alignment ALIGNMENT] [--snp SNP] [--tree TREE]
                      [--ancestral ANCESTRAL] [--core CORE] [--n_proc N_PROC]

EToKi phylo runs to:
(1) Generate SNP matrix from alignment (-t matrix)
(2) Calculate ML phylogeny from SNP matrix using RAxML (-t phylogeny)
(3) Workout the nucleotide sequences of internal nodes in the tree using ML estimation (-t ancestral or -t ancestral_proportion for ratio frequencies)
(4) Place mutations onto branches of the tree (-t mutation)

optional arguments:
  -h, --help            show this help message and exit
  --tasks TASKS, -t TASKS
                        Tasks to call. Allowed tasks are:
                        matrix: generate SNP matrix from alignment.
                        phylogeny: generate phylogeny from SNP matrix.
                        ancestral: generate AS (ancestral state) matrix from SNP matrix and phylogeny
                        ancestral_proportion: generate possibilities of AS for each site
                        mutation: assign SNPs into branches from AS matrix

                        You can run multiple tasks by sending a comma delimited task list.
                        There are also some pre-defined task combo:
                        all: matrix,phylogeny,ancestral,mutation
                        aln2phy: matrix,phylogeny [default]
                        snp2anc: phylogeny,ancestral
                        mat2mut: ancestral,mutation
  --prefix PREFIX, -p PREFIX
                        prefix for all outputs.
  --alignment ALIGNMENT, -m ALIGNMENT
                        aligned sequences in either fasta format or Xmfa format. Required for "matrix" task.
  --snp SNP, -s SNP     SNP matrix in specified format. Required for "phylogeny" and "ancestral" if alignment is not given
  --tree TREE, -z TREE  phylogenetic tree. Required for "ancestral" task
  --ancestral ANCESTRAL, -a ANCESTRAL
                        Inferred ancestral states in a specified format. Required for "mutation" task
  --core CORE, -c CORE  Core genome proportion. Default: 0.95
  --n_proc N_PROC, -n N_PROC
                        Number of processes. Default: 7.

EBEis - in silico serotype prediction for Escherichia coli & Shigella spp.

EBEis is a BLASTn based prediction tool for the O and H antigens of Escherichia coli and Shigella. It uses essential genes (wzx, wzy, wzt & wzm for O; fliC for H) as markers. EBEis uses a database built from two sources:

SeroTypeFinder
O-antigen gene sequences reported in DebRoy et al., PLoS ONE, 2016

usage: EToKi.py EBEis [-h] -q QUERY [-t TAXON] [-p PREFIX]

EnteroBase Escherichia in silico serotyping

optional arguments:
  -h, --help            show this help message and exit
  -q QUERY, --query QUERY
                        file name for the queried assembly in multi-FASTA format.
  -t TAXON, --taxon TAXON
                        Taxon database to compare with. 
                        Only support Escherichia (default) for the moment.
  -p PREFIX, --prefix PREFIX
                        prefix for intermediate files. Default: EBEis

isCRISPOL - in silico prediction of CRISPOL array for Salmonella enterica serovar Typhimurium

CRISPOL is an oligo based Typhimurium sub-typing method described in (Fabre et al., PLoS ONE, 2012). We use the direct repeats (DRs) and spacers in the Typhimurium CPRISR array to predict CRISPOL types from genomic assemblies.

usage: EToKi.py isCRISPOL [-h] [N [N ...]]

in silico Typhimurium subtyping using CRISPOL scheme (Fabre et al., PLoS ONE, 2012)

positional arguments:
  N           FASTA files containing assemblies of S. enterica Typhimurium.

optional arguments:
  -h, --help  show this help message and exit

uberBlast - Use BLASTn, uBLASTp, minimap2 and/or mmseqs to identify similar sequences

EToKi uberBlast is also internally called by EToKi ortho to align exemplar genes to queried genomes, using both BLASTn and uSearch-uBLASTp. Amino acid alignments are converted back to nucleotide sequences, meaning that genome coordinates remain consistent across different methods.

minimap2 --- Fastest alignment in nucleotide level. High accuracy in identities >= 90%, but lose sensitivity quickly for lower identities.
blastn --- Fast alignment in nucleotide level. Lose sensitivity for identities < 80%
mmseqs --- Amino acid based alignment for identities >= 70% (open source)
uBLASTp --- Amino acid based alignment for identities < 50% (commercial software)

usage: EToKi.py uberBlast [-h] -r REFERENCE -q QUERY [-o OUTPUT] [--blastn]
                          [--ublast] [--ublastSELF] [--minimap] [--minimapASM]
                          [--mmseq] [--min_id MIN_ID] [--min_cov MIN_COV]
                          [--min_ratio MIN_RATIO] [-s RE_SCORE] [-f]
                          [--filter_cov FILTER_COV]
                          [--filter_score FILTER_SCORE] [-m]
                          [--merge_gap MERGE_GAP] [--merge_diff MERGE_DIFF]
                          [-O] [--overlap_length OVERLAP_LENGTH]
                          [--overlap_proportion OVERLAP_PROPORTION]
                          [-e FIX_END] [-t N_THREAD] [-p]

Five different alignment methods.

optional arguments:
  -h, --help            show this help message and exit
  -r REFERENCE, --reference REFERENCE
                        [INPUT; REQUIRED] filename for the reference. This is
                        normally a genomic assembly.
  -q QUERY, --query QUERY
                        [INPUT; REQUIRED] filename for the query. This can be
                        short-reads or genes or genomic assemblies.
  -o OUTPUT, --output OUTPUT
                        [OUTPUT; Default: None] save result to a file or to
                        screen (stdout). Default do nothing.
  --blastn              Run BLASTn. Slowest. Good for identities between [80,
                        100]
  --ublast              Run uBLAST in tBLASTn mode. Fast. Good for identities
                        between [30-100]
  --ublastSELF          Run uBLAST in tBLASTn mode. Fast. Good for identities
                        between [30-100]
  --minimap             Run minimap. Fast. Good for identities between
                        [90-100]
  --minimapASM          Run minimap on assemblies. Fast. Good for identities
                        between [90-100]
  --mmseq               Run mmseq2 in tBLASTn mode. Fast. Good for identities
                        between [70-100]
  --min_id MIN_ID       [DEFAULT: 0.3] Minimum identity before reScore for an
                        alignment to be kept
  --min_cov MIN_COV     [DEFAULT: 40] Minimum length for an alignment to be
                        kept
  --min_ratio MIN_RATIO
                        [DEFAULT: 0.05] Minimum length for an alignment to be
                        kept, proportional to the length of the query
  -s RE_SCORE, --re_score RE_SCORE
                        [DEFAULT: 0] Re-interpret alignment scores and
                        identities. 0: No rescore; 1: Rescore with
                        nucleotides; 2: Rescore with amino acid; 3: Rescore
                        with codons
  -f, --filter          [DEFAULT: False] Remove secondary alignments if they
                        overlap with any other regions
  --filter_cov FILTER_COV
                        [DEFAULT: 0.9]
  --filter_score FILTER_SCORE
                        [DEFAULT: 0]
  -m, --linear_merge    [DEFAULT: False] Merge consective alignments
  --merge_gap MERGE_GAP
                        [DEFAULT: 300]
  --merge_diff MERGE_DIFF
                        [DEFAULT: 1.2]
  -O, --return_overlap  [DEFAULT: False] Report overlapped alignments
  --overlap_length OVERLAP_LENGTH
                        [DEFAULT: 300] Minimum overlap to report
  --overlap_proportion OVERLAP_PROPORTION
                        [DEFAULT: 0.6] Minimum overlap proportion to report
  -e FIX_END, --fix_end FIX_END
                        [FORMAT: L,R; DEFAULT: 0,0] Extend alignment to the
                        edges if the un-aligned regions are <= [L,R]
                        basepairs.
  -t N_THREAD, --n_thread N_THREAD
                        [DEFAULT: 8] Number of threads to use.
  -p, --process         [DEFAULT: False] Use processes instead of threads.

clust - linear-time clustering of short sequences using mmseqs linclust

EToKi clust is called internally by EToKi ortho to cluster seed genes into gene clusters. Given its linear-time complexity, it can cluster millions of gene sequences in minutes.

usage: EToKi.py clust [-h] -i INPUT -p PREFIX [-d IDENTITY] [-c COVERAGE]
                      [-t N_THREAD]

Get clusters and exemplars of clusters from gene sequences using mmseqs linclust.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        [INPUT; REQUIRED] name of the file containing gene sequneces in FASTA format.
  -p PREFIX, --prefix PREFIX
                        [OUTPUT; REQUIRED] prefix of the outputs.
  -d IDENTITY, --identity IDENTITY
                        [PARAM; DEFAULT: 0.9] minimum intra-cluster identity.
  -c COVERAGE, --coverage COVERAGE
                        [PARAM; DEFAULT: 0.9] minimum intra-cluster coverage.
  -t N_THREAD, --n_thread N_THREAD
                        [PARAM; DEFAULT: 8] number of threads to use.

Name		Name	Last commit message	Last commit date
Latest commit History 531 Commits
.idea		.idea
bin		bin
examples		examples
externals		externals
modules		modules
.gitignore		.gitignore
EToKi.py		EToKi.py
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
example.bash		example.bash
requirement.txt		requirement.txt

License

zheminzhou/EToKi

Folders and files

Latest commit

History

Repository files navigation