Skip to content

Determine which reference sequence is more likely to be present in a given sample

License

Notifications You must be signed in to change notification settings

B-UMMI/seq_typing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

seq_typing

Determines which reference sequence is more likely to be present in a given sample


Rational

seq_typing is a software to determine a given sample type using either a read mapping approach or a sequence Blast search against a set of reference sequences.
For the read mapping approach, the sample's reads are mapped to the given reference sequences using Bowtie2, parsed with Samtools and analysed via ReMatCh. Based on the length of the sequence covered and it's depth of coverage, seq_typing returns the type associated with the reference sequence which is more likely to be present. The selected sequence will be the one covered to a greater extent, with higher depth of coverage and with the highest identity (applied hierarchically following the order here described), that passes defined thresholds.
For the Blast approach (when using sequences fasta files) the sequence selected, for each DB sequence, is determined accordingly with the best Blast hit. The best hit is defined by the largest alignment length, highest similarity, lowest E-value and number of gaps, and largest reference sequence length (applied hierarchically following the order here described). The selected sequence criteria is the same used with the read mapping approach (although the depth of coverage will always be 1).
In both cases, manual curation and sequence type definition is required for reference sequences database production.

Input requirements

  • Illumina Fastq files
    OR
  • Sequence fasta file

Dependencies

For get_stx_db.py script:

Install dependencies

ReMatCh:

git clone https://github.com/B-UMMI/ReMatCh.git
cd ReMatCh
python3 setup.py install

NOTE:
If you don't have permission for global system installation, try the following install command instead:
python3 setup.py install --user

Blast+:

wget ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ncbi-blast-*-x64-linux.tar.gz
tar xf ncbi-blast-*-x64-linux.tar.gz
rm ncbi-blast-*-x64-linux.tar.gz
cd ncbi-blast-*/bin

# Temporarily add Blast binaries to the PATH
export PATH="$(pwd -P):$PATH"

# Permanently add Blast binaries to the PATH
echo export PATH="$(pwd -P):$PATH" >> ~/.profile

Install seq_typing

git clone https://github.com/B-UMMI/seq_typing.git
cd seq_typing
python3 setup.py install

NOTE:
If you don't have permission for global system installation, try the following install command instead:
python3 setup.py install --user

Usage

General use

General info

usage: seq_typing.py [-h] [--version] {reads,index,assembly,blast} ...

Determines which reference sequence is more likely to be present in a given
sample

optional arguments:
  -h, --help            Show this help message and exit
  --version             Version information

Subcommands:
  Valid subcommands

  {reads,index,assembly,blast}
                        Additional help
    reads               reads --help
    index               index --help
    assembly            assembly --help
    blast               blast --help
  • index module:
    Creates Bowtie2 index. This is useful when running the same reference sequences file for different reads dataset.
  • reads module:
    Run seq_typing.py using fastq files. If running multiple samples using the same reference sequences file, consider use first seq_typing.py index module.
  • blast module:
    Creates Blast DB. This is useful when running the same DB sequence file for different assemblies.
  • assembly module:
    Run seq_typing.py using a fasta file. If running multiple samples using the same DB sequence file, consider use first seq_typing.py blast module.
index module

Creates Bowtie2 index.
This is useful when running the same reference sequences file for different reads dataset.

usage: seq_typing.py index [-h]
                           -r /path/to/reference.fasta ... | --org escherichia coli
                           [-o /path/to/output/directory/] [-j N]

Creates Bowtie2 index. This is useful when running the same reference
sequences file for different reads dataset.

optional arguments:
  -h, --help            show this help message and exit

Required one of the following options:
  -r --reference /path/to/reference.fasta  ...
                        Path to reference sequences files. If more than one
                        file is passed, a Bowtie2 index for each file will be
                        created. (default: None)
  --org escherichia coli
                        Organism option with reference sequences provided
                        ("seqtyping/reference_sequences/" folder) together
                        with seq_typing.py for typing (default: None)

General facultative options:
  -o --outdir /path/to/output/directory/
                        Path to the directory where the information will be
                        stored (default: ./) (default: .)
  -j N, --threads N     Number of threads to use (default: 1) (default: 1)
reads module

Run seq_typing.py using fastq files.

usage: seq_typing.py reads [-h]
                           -f /path/to/input/file.fq.gz ...
                           -r /path/to/reference_sequence.fasta ... | --org escherichia coli
                           [-s sample-ID] [-o /path/to/output/directory/] [-j N]
                           [--typeSeparator _]
                           [--extraSeq N] [--minCovPresence N]
                           [--minCovCall N] [--minGeneCoverage N]
                           [--minDepthCoverage N] [--minGeneIdentity N]
                           [--bowtieAlgo="--very-sensitive-local"] [--maxNumMapLoc N]
                           [--doNotRemoveConsensus] [--saveNewAllele] [--typeNotInNew]
                           [--debug] [--resume]

Run seq_typing.py using fastq files. If running multiple samples using the
same reference sequences file, consider use first "seq_typing.py index"
module.

optional arguments:
  -h, --help            show this help message and exit

Required options:
  -f --fastq /path/to/input/file.fq.gz ...
                        Path to single OR paired-end fastq files. If two files
                        are passed, they will be assumed as being the paired
                        fastq files

Required one of the following options:
  -r --reference /path/to/reference_sequence.fasta ...
                        Path to reference sequences files. If Bowtie2 index was
                        already produced, only provide the file name that ends
                        with ".1.bt2", but without this termination (for
                        example, for a Bowtie2 index
                        "/file/sequences.fasta.1.bt2", only provide
                        "/file/sequences.fasta"). If no Bowtie2 index files
                        are found, those will be created in --outdir. If more
                        than one file is passed, a type for each file will be
                        determined. Give the files name in the same order that
                        the type must be determined. (default: None)
  --org escherichia coli
                        Organism option with reference sequences provided
                        together with seq_typing.py for typing
                        ("seqtyping/reference_sequences/" folder)

General facultative options:
  -s --sample sample-ID
                        Sample name (default: sample)
  -o --outdir /path/to/output/directory/
                        Path to the directory where the information will be
                        stored (default: ./)
  -j --threads N        Number of threads to use (default: 1)
  --typeSeparator _     Last single character separating the general sequence
                        header from the last part containing the type (default: _)
  --extraSeq N          Sequence length added to both ends of target sequences
                        (usefull to improve reads mapping to the target one)
                        that will be trimmed in ReMatCh outputs
                        (default when not using --org: 0)
  --minCovPresence N    Reference position minimum coverage depth to consider
                        the position to be present in the sample
                        (default when not using --org: 5)
  --minCovCall N        Reference position minimum coverage depth to perform a
                        base call (default when not using --org: 10)
  --minGeneCoverage N   Minimum percentage of target reference sequence
                        covered to consider a sequence to be present (value
                        between [0, 100]) (default when not using --org: 60)
  --minDepthCoverage N  Minimum depth of coverage of target reference sequence
                        to consider a sequence to be present (default: 2)
  --minGeneIdentity N   Minimum percentage of identity of reference sequence
                        covered to consider a gene to be present (value
                        between [0, 100]). One INDEL will be considered as one
                        difference
  --bowtieAlgo="--very-sensitive-local"
                        Bowtie2 alignment mode. It can be an end-to-end
                        alignment (unclipped alignment) or local alignment
                        (soft clipped alignment). Also, can choose between
                        fast or sensitive alignments. Please check Bowtie2
                        manual for extra information:
                        http://bowtie-bio.sourceforge.net/bowtie2/index.shtml .
                        This option should be provided between quotes and
                        starting with an empty space
                        (like --bowtieAlgo " --very-fast") or using equal
                        sign (like --bowtieAlgo="--very-fast")
                        (default when not using --org: "--very-sensitive-local")
  --maxNumMapLoc N      Maximum number of locations to which a read can map
                        (sometimes useful when mapping against similar sequences)
                        (default when not using --org: 1)
  --saveNewAllele       Save the new allele found for the selected type
                        (default: false)
  --typeNotInNew        Do not save the type of the selected sequence in the header
                        of the new allele (when writing uses the "--typeSeparator").
                        (default: false)
  --doNotRemoveConsensus
                        Do not remove ReMatCh consensus sequences
  --debug               Debug mode: do not remove temporary files
  --resume              Resume seq_typing.py reads
blast module

Creates Blast DB.
This is useful when running the same DB sequence file for different assemblies.

usage: seq_typing.py blast [-h]
                           -t nucl
                           -f /path/to/db.sequences.fasta ... | --org escherichia coli
                           [-o /path/to/output/directory/] [--extraSeq N]

Creates Blast DB. This is useful when running the same DB sequence file for
different assemblies.

optional arguments:
  -h, --help            show this help message and exit

Required one of the following options:
  -f --fasta /path/to/db.sequences.fasta ...
                        Path to DB sequences files. If more than one file is
                        passed, a Blast DB for each file will be created.
  --org escherichia coli
                        Organism option with DB sequences files provided
                        ("seqtyping/reference_sequences/" folder) together with
                        seq_typing.py for typing

Required option for --fasta:
  -t nucl, --type nucl  Blast DB type (available options: nucl, prot)

General facultative options:
  -o --outdir /path/to/output/directory/
                        Path to the directory where the information will be
                        stored (default: ./)
  --extraSeq N          Sequence length added to both ends of target sequences
                        (usefull when analysing data by reads mapping)
                        that will be trimmed for Blast analysis.
                        (default when not using --org: 0)
assembly module

Run seq_typing using a fasta file.
If running multiple samples using the same DB sequence file, consider use first seq_typing.py blast module.

usage: seq_typing.py assembly [-h]
                              -f /path/to/query/assembly_file.fasta
                              -b /path/to/Blast/db.sequences.file ... -t nucl | --org escherichia coli
                              [-s sample-ID] [-o /path/to/output/directory/] [-j N]
                              [--typeSeparator _] [--extraSeq N] [--minGeneCoverage N]
                              [--minGeneIdentity N] [--saveNewAllele] [--typeNotInNew]
                              [--debug] [--resume]

Run seq_typing.py using a fasta file. If running multiple samples using the
same DB sequence file, consider use first "seq_typing.py blast"
module.

optional arguments:
  -h, --help            show this help message and exit

Required options:
  -f /path/to/query/assembly_file.fasta, --fasta /path/to/query/assembly_file.fasta
                        Path to fasta file containing the query sequences from
                        which the types should be assessed

Required one of the following options:
  -b --blast /path/to/Blast/db.sequences.file ...
                        Path to DB sequences files. If Blast DB was already
                        produced, only provide the file that do not end with
                        ".n*" something (do not use for example
                        /blast_db.sequences.fasta.nhr). If no Blast DB is
                        found for the DB sequence file, one will be created in
                        --outdir. If more than one Blast DB file is passed, a
                        type for each file will be determined. Give the files
                        in the same order that the type must be determined.
  --org escherichia coli
                        Organism option with DB sequences files provided
                        ("seqtyping/reference_sequences/" folder) together with
                        seq_typing.py for typing

Required option for --blast:
  -t --type nucl        Blast DB type (available options: nucl, prot)

General facultative options:
  -s --sample sample-ID
                        Sample name (default: sample)
  -o --outdir /path/to/output/directory/
                        Path to the directory where the information will be
                        stored (default: ./)
  -j --threads N        Number of threads to use (default: 1)
  --typeSeparator _     Last single character separating the general sequence
                        header from the last part containing the type (default: _)
  --extraSeq N          Sequence length added to both ends of target sequences
                        (usefull when analysing data by reads mapping)
                        that will be trimmed for Blast analysis.
  --minGeneCoverage N   Minimum percentage of target reference sequence
                        covered to consider a sequence to be present (value
                        between [0, 100]) (default when not using --org: 60)
  --minGeneIdentity N   Minimum percentage of identity of reference sequence
                        covered to consider a gene to be present (value
                        between [0, 100])
  --saveNewAllele       Save the new allele found for the selected type
                        (default: false)
  --typeNotInNew        Do not save the type of the selected sequence in the header
                        of the new allele (when writing uses the "--typeSeparator").
                        (default: false)
  --debug               Debug mode: do not remove temporary files
  --resume              Resume seq_typing.py assembly

Organisms typing

For the following organisms, references sequences are provided.

  • Serotyping:
    • Escherichia coli
    • Staph agr (Staphylococcus aureus, agr typing)
    • Haemophilus influenzae
    • GBS sero (Group B Streptococcus, Streptococcus agalactiae, serotype)
    • Dengue virus (with genotype information)
  • Other types:
    • GBS pili (Group B Streptococcus, Streptococcus agalactiae, pili typing)
    • GBS surf (Group B Streptococcus, Streptococcus agalactiae, surface protein typing)
    • stx subtyping (Escherichia coli stx subtyping)

Use --org option with one of those organisms options

Usage examples

Reads

Serotyping Haemophilus influenzae using provided references sequences (that uses only one reference sequences file):

seq_typing.py reads --org Haemophilus influenzae \
                    --fastq sample_1.fq.gz sample_2.fq.gz \
                    --outdir sample_out/ \
                    --threads 2

Serotyping Escherichia coli using provided references sequences (that uses two reference sequences files):

seq_typing.py reads --org Escherichia coli \
                    --fastq sample_1.fq.gz sample_2.fq.gz \
                    --outdir sample_out/ \
                    --threads 2

Type one sample with a users own set of references sequences (using for example single-end reads):

seq_typing.py reads --reference references/Ecoli/O_type.fasta references/Ecoli/H_type.fasta \
                    --fastq sample.fq.gz \
                    --outdir sample_out/ \
                    --threads 2

When running the same reference sequences files for different reads dataset, the Bowtie2 index files can be produced before to speed up the analysis.
Example using Dengue virus provided reference sequences (that uses only one reference sequences file):

seq_typing.py index --org Dengue virus \
                    --outdir index_out/ \
                    --threads 2

# Run seq_typing using created database
seq_typing.py reads --reference index_out/1_GenotypesDENV_14-05-18.fasta \
                    --fastq sample_1.fq.gz sample_2.fq.gz \
                    --outdir sample_out/ \
                    --threads 2

The following examples show how to use users own reference sequences files. If many samples will be analysed using the same reference sequences file, a preliminary seq_typing.py index step is advisable to be run.

Run seq_typing without previous construction of reference database:

seq_typing.py reads --reference references/O_type.fasta references/H_type.fasta \
                    --fastq sample_1.fq.gz sample_2.fq.gz \
                    --outdir sample_out/ \
                    --threads 2

Run seq_typing with a preliminary step for Bowtie2 index production (useful when running multiple samples with the same reference sequences file):

# Preliminary step for Bowtie2 index construction.
seq_typing.py index --reference references/O_type.fasta references/H_type.fasta \
                    --outdir index_out/ \
                    --threads 2

# Run seq_typing using created database
seq_typing.py reads --reference index_out/O_type.fasta index_out/H_type.fasta \
                    --fastq sample_1.fq.gz sample_2.fq.gz \
                    --outdir sample_out/ \
                    --threads 2
Assemblies

Type Dengue virus using assemblies with provided reference sequences (uses only one reference sequences file):

seq_typing.py assembly --org Dengue virus \
                       --fasta sample.fasta \
                       --outdir sample_out/ \
                       --threads 2

When running the same database for different samples, a single Blast database should be produce first to speed up the analysis.
Example using Escherichia coli provided reference sequences (that uses two reference sequences files):

seq_typing.py blast --org Escherichia coli \
                    --outdir db_out/

# Run seq_typing using created database
seq_typing.py assembly --blast db_out/1_O_type.fasta db_out/2_H_type.fasta \
                       --type nucl \
                       --fasta sample.fasta \
                       --outdir sample_out/ \
                       --threads 2

For users own reference sequences files, seq_typing requires the construction of the reference database. seq_typing will construct the reference DB while analysing the sample's sequences. If many samples will be analysed using the same reference sequences file, a preliminary seq_typing.py blast step is advisable to be run.

Run seq_typing without previous construction of reference database:

seq_typing.py assembly --blast references/O_type.fasta references/H_type.fasta \
                       --type nucl \
                       --fasta sample.fasta \
                       --outdir sample_out/ \
                       --threads 2

Run seq_typing with a preliminary step for reference DB construction (useful when running multiple samples with the same reference sequences file):

# Preliminary step for reference DB construction.
seq_typing.py blast --blast references/O_type.fasta references/H_type.fasta \
                    --type nucl \
                    --outdir db_out/
# Run seq_typing using created database
seq_typing.py assembly --blast db_out/O_type.fasta db_out/H_type.fasta \
                       --type nucl \
                       --fasta sample.fasta \
                       --outdir sample_out/ \
                       --threads 2

E. coli stx subtyping

A specific script was created for E. coli stx subtyping (ecoli_stx_subtyping.py) in order to accommodate the possible existence of stx2 paralogs.
It works very similar to seq_typing.py.

General usage

usage: ecoli_stx_subtyping.py [-h] [--version] {reads,assembly,blast} ...

Gets E. coli stx subtypes

optional arguments:
  -h, --help            Show this help message and exit
  --version             Version information

Subcommands:
  Valid subcommands

  {reads,assembly}
                        Additional help
    reads               reads --help
    assembly            assembly --help
ecoli_stx_subtyping Reads

Run ecoli_stx_subtyping.py using fastq files.

usage: ecoli_stx_subtyping.py reads [-h]
                                    -f /path/to/input/file.fq.gz ...
                                    -r /path/to/reference_sequence.fasta ... | --org stx subtyping
                                    [--stx2covered N] [--stx2identity N]
                                    [--sample sample-ID] [-o /path/to/output/directory/] [-j N]
                                    [--typeSeparator _]
                                    [--extraSeq N] [--minCovPresence N]
                                    [--minCovCall N] [--minGeneCoverage N]
                                    [--minDepthCoverage N] [--minGeneIdentity N]
                                    [--bowtieAlgo="--very-sensitive-local"] [--maxNumMapLoc N]
                                    [--doNotRemoveConsensus] [--saveNewAllele] [--typeNotInNew]
                                    [--debug] [--resume]

Run ecoli_stx_subtyping.py using fastq files

optional arguments:
  -h, --help            show this help message and exit

Required options:
  -f --fastq /path/to/input/file.fq.gz ...
                        Path to single OR paired-end fastq files. If two files
                        are passed, they will be assumed as being the paired
                        fastq files

Required one of the following options:
  -r --reference 1_virulence_db.stx1_subtyping.fasta 2_virulence_db.stx2_subtyping.fasta
                        Path to stx subtyping reference sequences (if not want to use
                        the ones provided together with seq_typing.py)
  --org stx subtyping   To use stx subtyping reference sequences provided
                        together with seq_typing.py

ecoli_stx_subtyping specific facultative options:
  --stx2covered N       Minimal percentage of sequence covered to consider
                        extra stx2 subtypes (value between [0, 100]) (default: 100)
  --stx2identity N      Minimal sequence identity to consider extra stx2
                        subtypes (value between [0, 100]) (default: 99.5)

General facultative options:
  -s --sample sample-ID
                        Sample name (default: sample)
  -o --outdir /path/to/output/directory/
                        Path to the directory where the information will be
                        stored (default: ./)
  -j --threads N        Number of threads to use (default: 1)
  --typeSeparator _     Last single character separating the general sequence
                        header from the last part containing the type (default: _)
  --extraSeq N          Sequence length added to both ends of target sequences
                        (usefull to improve reads mapping to the target one)
                        that will be trimmed in ReMatCh outputs (default: 0)
  --minCovPresence N    Reference position minimum coverage depth to consider
                        the position to be present in the sample (default: 5)
  --minCovCall N        Reference position minimum coverage depth to perform a
                        base call (default: 10)
  --minGeneCoverage N   Minimum percentage of target reference sequence
                        covered to consider a sequence to be present (value
                        between [0, 100]) (default: 60)
  --minDepthCoverage N  Minimum depth of coverage of target reference sequence
                        to consider a sequence to be present (default: 2)
  --minGeneIdentity N   Minimum percentage of identity of reference sequence
                        covered to consider a gene to be present (value
                        between [0, 100]). One INDEL will be considered as one
                        difference
  --bowtieAlgo="--very-sensitive-local"
                        Bowtie2 alignment mode. It can be an end-to-end
                        alignment (unclipped alignment) or local alignment
                        (soft clipped alignment). Also, can choose between
                        fast or sensitive alignments. Please check Bowtie2
                        manual for extra information:
                        http://bowtie-bio.sourceforge.net/bowtie2/index.shtml .
                        This option should be provided between quotes and
                        starting with an empty space
                        (like --bowtieAlgo " --very-fast") or using equal
                        sign (like --bowtieAlgo="--very-fast")
                        (default when not using --org: "--very-sensitive-local")
  --maxNumMapLoc N      Maximum number of locations to which a read can map
                        (sometimes useful when mapping against similar sequences)
                        (default when not using --org: 1)
  --saveNewAllele       Save the new allele found for the selected type
                        (default: false)
  --typeNotInNew        Do not save the type of the selected sequence in the header
                        of the new allele (when writing uses the "--typeSeparator").
                        (default: false)
  --doNotRemoveConsensus
                        Do not remove ReMatCh consensus sequences
  --debug               Debug mode: do not remove temporary files
  --resume              Resume seq_typing.py reads
ecoli_stx_subtyping Assembly

Run ecoli_stx_subtyping using a fasta file.

usage: ecoli_stx_subtyping.py assembly [-h]
                                       -f /path/to/query/assembly_file.fasta
                                       -b /path/to/Blast/db.sequences.file ... -t nucl | --org stx subtyping
                                       [--stx2covered N] [--stx2identity N]
                                       [--sample sample-ID] [-o /path/to/output/directory/] [-j N]
                                       [--typeSeparator _] [--extraSeq N] [--minGeneCoverage N]
                                       [--minGeneIdentity N] [--saveNewAllele] [--typeNotInNew]
                                       [--debug] [--resume]

Run ecoli_stx_subtyping.py using a fasta file. If running multiple samples using the
same DB sequence file, consider use first "seq_typing.py blast"
module.

optional arguments:
  -h, --help            show this help message and exit

Required options:
  -f /path/to/query/assembly_file.fasta, --fasta /path/to/query/assembly_file.fasta
                        Path to fasta file containing the query sequences from
                        which the stx subtypes should be assessed

Required one of the following options:
  -b --blast 1_virulence_db.stx1_subtyping.fasta 2_virulence_db.stx2_subtyping.fasta
                        Path to stx subtyping DB sequence file (if not want to use
                        the ones provided together with seq_typing.py).
                        If Blast DB was already produced (using "seq_typing.py blast"
                        module) only provide the file that do not end with ".n*"
                        something (do not use for example
                        /blast_db.sequences.fasta.nhr). If no Blast DB is
                        found for the DB sequence file, one will be created in
                        --outdir. If more than one Blast DB file is passed, a
                        type for each file will be determined. Give the files
                        in the same order that the type must be determined.
  --org stx subtyping   To use stx subtyping reference sequences provided
                        together with seq_typing.py

Required option for --blast:
  -t --type nucl        Blast DB type (available options: nucl, prot)

ecoli_stx_subtyping specific facultative options:
  --stx2covered 95      Minimal percentage of sequence covered to consider
                        extra stx2 subtypes (value between [0, 100]) (default: 100)
  --stx2identity 95     Minimal sequence identity to consider extra stx2
                        subtypes (value between [0, 100]) (default: 99.5)

General facultative options:
  -s --sample sample-ID
                        Sample name (default: sample)
  -o --outdir /path/to/output/directory/
                        Path to the directory where the information will be
                        stored (default: ./)
  -j --threads N        Number of threads to use (default: 1)
  --typeSeparator _     Last single character separating the general sequence
                        header from the last part containing the type (default: _)
  --extraSeq N          Sequence length added to both ends of target sequences
                        (usefull when analysing data by reads mapping)
                        that will be trimmed for Blast analysis.
  --minGeneCoverage N   Minimum percentage of target reference sequence
                        covered to consider a sequence to be present (value
                        between [0, 100]) (default: 60)
  --minGeneIdentity N   Minimum percentage of identity of reference sequence
                        covered to consider a gene to be present (value
                        between [0, 100])
  --saveNewAllele       Save the new allele found for the selected type
                        (default: false)
  --typeNotInNew        Do not save the type of the selected sequence in the header
                        of the new allele (when writing uses the "--typeSeparator").
                        (default: false)
  --debug               Debug mode: do not remove temporary files
  --resume              Resume seq_typing.py reads
Blast

To construct stx subtypes Blast DB, proceed as described here:
seq_typing.py blast --org stx subtyping.

Update stx references

An updated stx subtyping reference sequences can be obtained from VirulenceFinder DB Bitbucket account. A specific script was created to get the most recent stx reference sequences.

usage: get_stx_db.py [-h] [--version]
                     [-o /path/to/output/directory/]

Gets STX sequences from virulencefinder_db to produce a STX subtyping DB.

optional arguments:
  -h, --help            show this help message and exit
  --version             Version information

General facultative options:
  -o --outdir /path/to/output/directory/
                        Path to the directory where the sequences will be
                        stored (default: ./)

Usage example

get_stx_db.py --outdir /path/output/directory/

Container

What is a (Docker) container?

"(...) is a tool that can package an application and its dependencies in a virtual container that can run on any Linux server," Lyman explained. "This helps enable flexibility and portability on where the application can run, whether on premise, public cloud, private cloud, bare metal, etc." From here.

Why are containers useful?

"(...) Docker containers technology allows you to write self-contained and truly reproducible computational pipelines." From here.

For detailed information on how to run seq_typing using containers, please check here.

Outputs

seq_typing.py

seq_typing.report.txt
Text file with the typing result. If it was not possible to determine a type for a given reference file, NT (for None Typeable) will be returned for that file.

Example of E. coli serotyping (two reference files):
O157:H7
Example of Dengue virus serotyping and genotyping (only one reference file):
3-III

seq_typing.report_types.tab
Tabular file with detailed results:

  • General fields
    • sequence_type: type of the results reported. Three values can be found here. selected for the reference sequences selected for the reported typing result. other_probable_type for other reference sequences that could have been selected because fulfill selection thresholds. most_likely for the most likely reference sequences when no reference sequences fulfill selection thresholds.
    • reference_file: the reference file where the sequences came from.
    • type: the type associated to the reference sequence
    • sequence: reference sequences name
    • sequenced_covered: percentage of reference sequences covered
    • coverage_depth: mean reference sequences depth of coverage of the positions present (1 if assembly was used)
    • sequence_identity: percentage identity of reference sequences covered
  • Assembly fields (filled with NA if reads were used)
    • query: name of the provided sequence that had hit with the given reference sequence
    • q_start: hit starting position of the provided sequence
    • q_end: hit ending position of the provided sequence
    • s_start: hit starting position of the reference sequence
    • s_end: hit ending position of the reference sequence
    • evalue: hit E-value

Example of E. coli serotyping (two reference files) using reads:

#sequence_type reference_file type sequence sequenced_covered coverage_depth sequence_identity query q_start q_end s_start s_end evalue gaps
selected O_type.fasta O26 wzy_192_AF529080_O26 100.0 281.95405669599216 100.0 NA NA NA NA NA NA NA
selected H_type.fasta H11 fliC_269_AY337465_H11 99.4546693933197 51.76490747087046 99.86291980808772 NA NA NA NA NA NA NA
other_probable_type O_type.fasta O26 wzx_208_AF529080_O26 100.0 223.3072050673001 100.0 NA NA NA NA NA NA NA
other_probable_type H_type.fasta H11 fliC_276_AY337472_H11 98.84117246080436 37.52551724137931 99.86206896551724 NA NA NA NA NA NA NA

Example of Dengue virus serotyping and genotyping (only one reference file) using assembly:

#sequence_type reference_file type sequence sequenced_covered coverage_depth sequence_identity query q_start q_end s_start s_end evalue gaps
selected 1_GenotypesDENV_14-05-18.fasta 3-III gb:EU529683#...#Subtype:3-III#Host:Human#seqTyping_3-III 100.0 1 99.223 NODE_1_length_10319_cov_2021.782660 138 10307 10170 1 0.0 0
other_probable_type 1_GenotypesDENV_14-05-18.fasta 1-V gb:GQ868570#...#Subtype:1-V#Host:Human#seqTyping_1-V 100.0 1 99.479 NODE_2_length_10199_cov_229.028848 13 10188 1 10176 0.0 0
other_probable_type 1_GenotypesDENV_14-05-18.fasta 4-II gb:GQ868585#...#Subtype:4-II#Host:Human#seqTyping_4-II 100.0 1 99.38 NODE_4_length_10182_cov_29.854132 13 10173 1 10161 0.0 3

new_allele/
Folder with a subfolder named with the reference file name from which the new allele was found. The novel allele is stored inside a file named with the selected type. If it is not possible to retreive the entire sequence of the new allele, "_partial" string will be added to the header. The header of the sequence will contain the sample name (the default is "sample") and the selected type separated by the --typeSeparator option (this behaviour can be deactivated with the --typeNotInNew option).
In the case of using extra/flanking sequences to the target sequence, if the full length of such extra/flanking sequences could be retreived, a new file ending with ".extra_seq.fasta" will be created (not yet implemented for reads).

Example
For Dengue virus serotyping and genotyping:

/outdir/
        seq_typing.report.txt
        seq_typing.report_types.tab
        
        new_allele/
                   1_GenotypesDENV_14-05-18.fasta/
                                                  3-III.fasta
                                                             >sample_partial_3-III
                                                             ATGTAAGCATGAGGTCACCAT ...
                                                  3-III.extra_seq.fasta
                                                             >sample_partial_3-III
                                                             CCCCCTTTTTATGTAAGCATGAGGTCACCAT ...

        run.20190131-162341.log

run.*.log
Running log file.

ecoli_stx_subtyping.py

seq_typing.ecoli_stx_subtyping.txt
Text file with the typing result. The secondary results for stx2 genes are presented between brackets.
Example:
stx1a:stx2c(stx2d)
NOTE: For stx2 gene, stx2a, stx2c and stx2d variants are grouped together as stx2acd due to the fact that all of these subtypes are the most potent ones to cause HUS and are difficult to separate from each other by the methods in use right now.

seq_typing.ecoli_stx_subtyping.report_types.tab
Tabular file with detailed results similar to the above seq_typing.report_types.tab file:
Example (using reads):

#sequence_type reference_file type sequence sequenced_covered coverage_depth sequence_identity query q_start q_end s_start s_end evalue gaps
selected 1_virulence_db.stx1_subtyping.fasta stx1a stx1A:15:AF461168:A:seqTyping_stx1a 100.0 65.37447257383967 100.0 NA NA NA NA NA NA NA
selected 2_virulence_db.stx2_subtyping.fasta stx2c stx2B:15:AB071845:C:seqTyping_stx2c 100.0 19.377777777777776 100.0 NA NA NA NA NA NA NA
other_probable_type 1_virulence_db.stx1_subtyping.fasta stx1c stx1B:11:AB071620:C:seqTyping_stx1c 100.0 21.64814814814815 99.25925925925925 NA NA NA NA NA NA NA
other_probable_type 1_virulence_db.stx1_subtyping.fasta stx1a stx1B:14:AM230663:A:seqTyping_stx1a 100.0 45.06666666666667 100.0 NA NA NA NA NA NA NA
other_probable_type 2_virulence_db.stx2_subtyping.fasta stx2c stx2B:10:EF441604:C:seqTyping_stx2c 100.0 17.2 99.25925925925925 NA NA NA NA NA NA NA
other_probable_type 2_virulence_db.stx2_subtyping.fasta stx2d stx2B:11:FM998840:D:seqTyping_stx2d 100.0 9.996296296296297 99.62962962962963 NA NA NA NA NA NA NA

new_allele/
Folder with a subfolder named with the reference file name from which the new allele was found. The novel allele is stored inside a file named with the selected type. If it is not possible to retreive the entire sequence of the new allele, "_partial" string will be added to the header. The header of the sequence will contain the sample name (the default is "sample") and the selected type separated by the --typeSeparator option (this behaviour can be deactivated with the --typeNotInNew option).
In the case of using extra/flanking sequences to the target sequence, if the full length of such extra/flanking sequences could be retreived, a new file ending with ".extra_seq.fasta" will be created (not yet implemented for reads).

Example:

/outdir/
        seq_typing.ecoli_stx_subtyping.txt
        seq_typing.ecoli_stx_subtyping.report_types.tab
        
        new_allele/
                   2_virulence_db.stx2_subtyping.fasta/
                                                       stx2c.fasta
                                                                  >sample_stx2c
                                                                  ATGTAAGCATGAGGTCACCAT ...
                                                       stx2c.extra_seq.fasta
                                                                  >sample_stx2c
                                                                  CCCCCTTTTTATGTAAGCATGAGGTCACCAT ...
                   1_virulence_db.stx1_subtyping.fasta/
                                                       stx1a.extra_seq.fasta
                                                                  >sample_partial
                                                                  CCCCCTTTTTATGTAAGCATGAGGTCACCAT ...

        run.20190131-162341.log

run.*.log
Running log file.

Citation

MP Machado, J Halkilahti, I Mendes, M Pinto, E Lizarazo, JP Gomes, M Ramirez, M Rossi, JA Carrico. seq_typing GitHub https://github.com/B-UMMI/seq_typing

Contact

Miguel Machado
mpmachado@medicina.ulisboa.pt