LTS-BiodiversityAnalysisCAO

Code repository for the support project P_Snoeijs-Leijonmalm_2205.

Overview

This workflow generates taxonomic and genomic profiles of metagenomic and metatranscriptomic samples. It consists of two main tracks: genome mapping and marker gene taxonomic assignments.

In the genome mapping track preprocessed reads are mapped against a set of target species genomes followed by a filtering step using nucleotide BLAST searches against the nt database. Read counts are collated and normalized to counts per million (CPM) reads.

In the marker gene track preprocessed reads are first mapped against a database of cytochrome oxidase subunit 1 (COI) reference sequences followed by filtering away reads with low percent identity match or short alignment length. In addition, a filtering step is performed to remove potential prokaryotic reads before the remaining reads are classified taxonomically using the sintax algorithm (implemented in vsearch) and a COI reference database.

Installation

Clone the repository

git clone git@github.com:NBISweden/LTS-BiodiversityAnalysisCAO.git

Install the software environment

conda env create -f environment.yml

Activate the environment

conda activate biodivcao

Setup & configuration

Configuration file

The workflow behaviour can be set using an external config file in YAML format. A default config file is supplied under config/config.yml.

The two primary input parameters for the workflow are

a samples file listing the samples to be analysed and
a mappings: entry in the config file specifying which reference databases (genomes and/or marker genes) to use for mapping and taxonomic profiling

Samples file

The samples to be used in the workflow should be specified in a comma-separated file that you point to with the config parameter sample_list in the config file, e.g.:

sample_list: "mysample_list.csv"

The samples file should have the following format:

sample_name,fwd_libs,rev_libs,lib_type
mysample1,path/to/mysample1_R1.fastq.gz,/path/to/mysample1_R2.fastq.gz,DNA
mysample2,path/to/mysample2_R1.fastq.gz,/path/to/mysample2_R2.fastq.gz,RNA

The sample_name column specifies sample names used in the workflow and do not have to match the file pattern in the actual fastq files.

The fwd_libs and rev_libs columns should contain the path to fastq files for the forward and reverse reads of each sample, respectively. If there are several fastq files for a sample you can specify these by separating them with a semicolon (;) in the fwd_libs and rev_libs columns, e.g.:

sample_name,fwd_libs,rev_libs,lib_type
mysample3,path/to/mysample3-1_R1.fastq.gz;path/to/mysample3-2_R1.fastq.gz,/path/to/mysample3-1_R2.fastq.gz;path/to/mysample3-2_R1.fastq.gz,DNA

The lib_type column specifies whether the sample has reads from DNA or RNA sequences. Currently, the workflow only maps DNA reads against the reference genomes database in the Target species track.

Specifying reference databases

The config entry mappings: specifies which reference databases to use. This config entry can have two types of nested entries: genomes or marker_genes. The genomes entry specifies the reference genomes database to use in the Target species track and the marker_genes entry specifies the reference marker genes database to use in the Marker gene track. An example is shown below:

mappings:
  genomes:
    target_species:
      fasta: "resources/genome_index/all_verts_whuman/target_species.fasta"
      taxon_table: "resources/genome_index/all_verts_whuman/taxon_table.csv"
      to_include: "resources/genome_index/all_verts_whuman/to_include.bed"
      mapq: 30
  marker_genes:
    coinr:
      fasta: "resources/coinr/COInr.fasta"
      ani_cutoff: 90
      min_len: 80

In this example, target_species is the name used for the genomes database. It has three parameters: fasta, taxon_table and to_include. The fasta parameter specifies the path to the fasta file with the reference genomes. The taxon_table parameter specifies the path to the taxon table file with contigs and their taxonomic information for genomes in the database. The to_include parameter specifies the path to the bed file with the contigs to the actual target species included in the reference genomes database. Finally, the mapq parameter is the minimum map quality required for reads at the mapping stage of the workflow. For more details see Target species track.

For the marker gene track, the example above specifies a marker gene database named coinr. Each database entry has three parameters: fasta, ani_cutoff and min_len. The fasta parameter specifies the path to the fasta file with the reference marker genes. The ani_cutoff parameter specifies the minimum average nucleotide identity (ANI) between a read and a reference marker gene to be kept for downstream analysis. The min_len parameter specifies the minimum alignment length between a read and a reference marker gene to be kept for downstream analysis. For more details see Marker gene track.

Note Note that you can use any name for the databases. The names target_species and coinr are just examples. You can also include any number of database entries under the genomes: and marker_genes: entries. The workflow will run the mapping and taxonomic profiling for each database entry.

Target species track

The part of the workflow that maps reads to genomes of target species (e.g fish) requires a genomes database (for initial mapping) and a nucleotide BLAST database (for filtering).

Target species genomes database To identify reads originating from target species of interest (e.g. fish) you must supply a fasta file with these genomes. The relevant config parameters are:

mappings:
  genomes:
    target_species:
      fasta: "resources/genome_index/all_verts_whuman/target_species.fasta"
      taxon_table: "resources/genome_index/all_verts_whuman/taxon_table.csv"
      to_include: "resources/genome_index/all_verts_whuman/to_include.bed"

The fasta and taxon_table parameters specify paths to the fasta file and seqid->taxid mapfile, respectively. The taxon_table file should be a comma-separated file with sequence ids (matching the id in the fasta file header) in the first column, species name in the second column, species taxon id in the third column and family taxon id in the fourth column. Examples of the two file contents are shown below:

fasta

>AESE010000001.1 Leucoraja erinacea LER_WGS_1_CONTIG_1, whole genome shotgun sequence
ACCATCATCAGCGATGATAGGGTTTACTTTAGAAATTATTCTGAGCAGAAACTTGAAGCTAAACCTCTGAAAATTTGCAA
ATCCTTTTGTATCTTTTGCAGCTGAGTTTAGAATTGTACCAAAGCACTCTAATCTTTTTTCGCAATTGATAAAACTGATG
CTTCTGTAGCACGTGGAGTAAACACTTCCTCTGAATTAAAGTCCTGTTGAAATGGTACAGAAACACTGCCAGAAGAACAC
TTGGCACCTCAATCACAATATTCTTGTAGCACTTGGTGCGTTTA
>CAKAOH010043426.1 Sebastiscus tertius genome assembly, contig: S_tertius_139529_jcf7180000782172, whole genome shotgun sequence
tcgggagcaatttggggttaagtgtcttgctcaaggacacatcgacatgtgaccggagcaaagggatcgaaccaccgacc
ttccagttgatggacgacctgctctacctacctctgagccacagtcgccCCTGAGGTACCTGAGGTAGATGAGGAtggag
gtgggggtggaggaaggGTGGAGACCTAGAGGGTGATGGGGGTGAGAGGGAGGtcgagagagaggaagggttaGTAAATg
gttattcacacacacattcacacactgatggcagaggctgccatgcaaggtgccaacctgcccat

taxon_table

CM054007.1,Leucoraja erinacea,7782,30475
CAKAOH010000016.1,Sebastiscus tertius,1472224,274692

The to_include bed file that includes all of the contigs that are of interest for downstream analysis. It could include an entry for all of the sequences in the target species genome database or just a subset. Here, it is used to filter alignments to human and herring. That is, the bedfile contains all of the contigs that are included in the target species genomes database except for human and herring.

to_include

KB228878.1	0	17703537
KB228879.1	0	17817800
KB228880.1	0	9835568

mapq: The mapq parameter specifies the minimum map quality required for reads to be kept for downstream analysis.

The target species genome database can be built in any way you like, as long as it conforms to the file format outlined here. A GitHub repository with code to generate such a database can be found here and you're free to use that repo as a starting point.

BLAST database Reads mapped to the target species database are used as queries in a blastn search against a BLAST-formatted database and the results are used as an additional filtering step to only keep reads with a high likelihood of originating from fish genomes. Specifically, in this step reads must have a match to sequences in the BLAST database originating from rayfinned (actinopterygii) or cartilagenous (chondrichthyes) fish.

The relevant config parameters are:

blast:
  db: "/sw/data/blast_databases/nt"
  evalue: 1e-30
  max_target_seqs: 200

The db parameter points to an 'alias' of the BLAST database (in this case the actual database has several files nt.*.nhd, nt.*.nhi, nt.*.nhr, nt.*.nin, nt.*.nnd, nt.*.nni, nt.*.nog and nt.*.nsq under /sw/data/blast_databases/).

The evalue parameter specifies the maximum e-value of hits to keep in the blastn search.

The max_target_seqs parameter specifies the maximum hits to keep for each query.

Marker gene track

The part of the workflow that maps reads against a marker gene database (e.g. the cytochrome c oxidase subunit I (COI) gene) requires:

a fasta file of reference marker genes (for initial mapping),
a KrakenUniq database (for filtering) and
a SINTAX-formatted fasta file (for taxonomic assignments)

Marker gene fasta file The marker gene track of the workflow requires a fasta file with reference sequences. The relevant config parameters are:

mappings:
  marker_genes:
    coinr:
      fasta: "resources/coinr/COInr.fasta"
      ani_cutoff: 90
      min_len: 80

The fasta parameter specifies the path to the fasta file with the reference marker genes. The marker gene track of the workflow starts with mapping preprocessed reads against sequences in this fasta file. The ani_cutoff and min_len parameters specify the minimum average nucleotide identity (ANI) and minimum alignment length between a read and a reference marker gene to be kept for downstream analysis. Reads passing these filters are then passed to KrakenUniqt to filter out reads that are likely to originate from contaminants. This second filtering can be modified by thekrakenuniq: config entry (see below). The remaining reads are then classified taxonomically using the sintax algorithm (implemented in vsearch).

KrakenUniq database

KrakenUniq is used in the workflow primarily to filter sequences that are likely to originate from contaminants such as prokaryotes or humans. The following config parameters control the behaviour of this filtering:

krakenuniq:
  db: "/sw/data/KrakenUniq_data/prebuilt/krakendb-2022-06-16-STANDARD"
  exclude: [2,2157,9604]

The db parameter specifies the path to a krakenuniq database. You can either download prebuilt databases from here or build a database on your own (see the KrakenUniq) and Kraken2 manuals). The path specified with db must contain the files database.idx, database.kdb and database.kdb.counts.

The exclude parameter is a list of taxonomic ids to exclude from downstream analyses. These are passed to krakenuniq-extract-reads and only reads not assigned to these taxonomic ids (or their children) are kept. By default, the taxa excluded are Bacteria (taxid: 2), Archaea (taxid: 2157) and Hominidae (taxid: 9604).

SINTAX database In the marker gene mapping track of the workflow, reads are classified taxonomically using the sintax algorithm implemented in vsearch. This requires that a fasta file is supplied and formatted in a specific way. You may use the same fasta file as in the marker gene mapping step (described above) as long as it conforms to the format required by sintax (see below).

The workflow includes steps to download a COI reference database using a non-redundant database created using the mkCOInr tool. To use this database, simply set the fasta: parameter under your marker gene database entry to resources/coinr/COInr.sintax.fasta. Also use the same setting for db: parameter under the sintax config entry (see below). The workflow will then automatically download the database and format it for use with sintax.

If you want to use a custom database for the sintax classifications the fasta file should have taxonomic information in sequence headers formatted as in the example below:

>KJ592702_1;tax=k:Eukaryota,p:Rhodophyta,c:Florideophyceae,o:Hapalidiales,f:Mesophyllumaceae,g:Mesophyllum,s:Mesophyllum_lichenoides

This format is required by the sintax algorithm and allows sequences to be classified at different ranks (here the ranks kingdom (k), phylum (p) etc.) are used.

The behaviour of the sintax algorithm can be controlled using the following config entry:

sintax:
  cutoff: 0.8
  ranks: ["kingdom", "phylum", "class", "order", "family", "genus", "species"]
  db: "resources/coinr/COInr.sintax.fasta"

The cutoff parameter specifies the minimum confidence cutoff used by sintax to assign taxonomy to sequences (see the original usearch docs).

The ranks parameter specifies which ranks to use in the taxonomic classification output. Make sure to use the same ranks as those given in the fasta headers of the SINTAX fasta file.

Running the workflow

When you've set up the required resources you're ready to start the workflow with the command:

snakemake --configfile yourconfigfile.yml -j 1

This will start the workflow using settings supplied in yourconfigfile.yml (update this to match your setup) and using at most one CPU core.

Configuration profiles

The workflow comes with two configuration profiles: local and slurm for running locally (e.g. on your laptop) or on a compute cluster with the SLURM workload manager, respectively. You specify which of these profiles to use by supplying either --profile local or --profile slurm to the command line call, e.g.:

snakemake --configfile yourconfigfile.yml --profile slurm

We strongly suggest that you use these profiles as they contain several useful Snakemake command line settings.

In order to use the slurm profile you should update the file slurm/config.yaml with your SLURM account id. Modify the file by opening it in your favourite text editor. Make sure that the default-resources entry has the correct SLURM account id.

default-resources: "slurm_account=naiss2023-1-000"

Now you can run the workflow as:

snakemake --profile slurm

Name		Name	Last commit message	Last commit date
Latest commit History 428 Commits
.github/workflows		.github/workflows
config		config
doc		doc
local		local
results/mappings/counts		results/mappings/counts
slurm		slurm
test		test
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

License

NBISweden/LTS-BiodiversityAnalysisCAO

Folders and files

Latest commit

History

Repository files navigation

LTS-BiodiversityAnalysisCAO

Overview

Installation

Setup & configuration

Configuration file

Samples file

Specifying reference databases

Target species track

Marker gene track

Running the workflow

Configuration profiles

About

Topics

Resources

License

Stars

Watchers

Forks

Languages