NGS-4-ECOPROD wrapper/pipeline collection

v0.1

NGS-4-ECOPROD wrapper/pipeline collection is primarily dedicated to metagenome data processing and analysis. The installation script sets up a Miniconda folder and Conda environments where all the necessary tools are installed. It does not interfere with the Linux system. ngs4ecoprod aims to simplify the often complex tasks (especially for beginners) associated with this kind of data by automating key steps in the processing of raw sequence data to human interpretable data. The pipeline provides basic analysis scripts and tools from the public domain. The overarching goal is to optimize time utilization by streamlining data workflows, allowing researchers to devote more time to the substantive biological analysis.

This repository is developed in the framework of NGS-4-ECOPROD at the University of Göttingen. The pipeline aims to automate and simplify metagenomic workflows (including 16S/18S rRNA gene amplicon analysis, metagenomes derived from Illumina paired-end sequencing, metagenomes derived from Nanopore long-reads etc.).

The pipeline was tested under Linux (Ubuntu 20.04 & 22.04 LTS) and is encapsuled in a miniconda environment with the intention to not affect the Linux operating system it is installed on.

Installation

NOTE: After the recent updates of miniconda and mamba (12th of July), installation seems to work only on recent Linux distributions (2020 and up).

You can either install the pipeline as a user into your home or as server admin in - for example - /opt and make it accessable for every user via an alias in the users .bashrc or corresponding shell (e.g., zsh, ksh, tcsh etc.).

The current disk space requirement for the installation is approximately 23 GB without the databases. However, when including all databases needed, the total disk space needed increases to roughly 1 TB (July 2023), with SILVA requiring an additional 1 GB, kraken2 database (nt) requiring 676 GB, kaiju database (nr) requiring 187 GB, and GTDB-Tk & PLSDB requiring 80 GB.

User installation, local

# 1. Download installation script
wget https://raw.githubusercontent.com/dschnei1/ngs4ecoprod/main/install_ngs4ecoprod

# 2. Install ngs4ecoprod (in this example into your home ~/ngs4ecoprod)
bash install_ngs4ecoprod -i ~/ngs4ecoprod

# 3. Restart terminal or type
source ~/.bashrc

# 4. Activate environment
activate_ngs4ecoprod

# 5. Remove installer
rm -f install_ngs4ecoprod

Admin installation, system wide

# 1. Download bash installation script
wget https://raw.githubusercontent.com/dschnei1/ngs4ecoprod/main/install_ngs4ecoprod

# 2. Install ngs4ecoprod (in this example into your home ~/ngs4ecoprod)
sudo bash install_ngs4ecoprod -i /opt/ngs4ecoprod

# 3. To activate the environment ensure every user has the following alias in .bashrc: 
# alias activate_ngs4ecoprod='source /opt/ngs4ecoprod/bin/activate ngs-4-ecoprod'
# Example command
echo "alias activate_ngs4ecoprod='source /opt/ngs4ecoprod/bin/activate ngs-4-ecoprod'" >> ~/.bashrc

# 4. Restart terminal or type
source ~/.bashrc

# 5. Activate environment
activate_ngs4ecoprod

# 6. Remove installer
rm -f install_ngs4ecoprod

Note: Before first use please run GNU parallel once after activating your conda environment and agree to conditions to cite or pay GNU parallel

parallel --citation

Here is a list of all software installed by install_ngs4ecoprod via conda, in addition NanoPhase, metaWRAP, GTDB-tk, BLCA, sra-toolkit are installed alongside.

Install databases

Silva database for ngs4_16S & ngs4_16S_blca & ngs4_18S

ngs4_download_silva_db -i ~/ngs4ecoprod/ngs4ecoprod/db

Download GTDB-tk and PLSDB databases for ngs4_np_assembly

ngs4_download_nanophase -i ~/ngs4ecoprod/ngs4ecoprod/db

Note: GTDBtk database download is very slow, a mirror of the database will be available soon

Download precompiled kraken2 (nt) and kaiju (nr) databases for ngs4_tax & ngs4_np_tax

ngs4_download_tax_k2k -i ~/ngs4ecoprod/ngs4ecoprod/db

Note: Download will be performed in current directory - make sure you have ~580 Gb (+676 Gb if you install/extract to the same disk) of disk space before starting the download.

Uninstall NGS-4-ECOPROD

To remove the pipeline do the following (adapt .bashrc to your shell)

# 1. Remove conda folder
rm -rf ~/ngs4ecoprod

# 2. Remove alias from .bashrc
sed -i -E "/^alias activate_ngs4ecoprod=.*/d" ~/.bashrc

Usage

So far the repository contains the following data processing scripts:

Amplicon analysis pipeline (16S rRNA gene, Bacteria and Archaea, 18S rRNA gene Eukaryota)
ngs4_16S
ngs4_16S_blca
ngs4_16S_blca_ncbi
ngs4_18S
ngs4_18S_blca
Nanopore: Metagenome analysis Under active development!
ngs4_np_qf
ngs4_np_tax
ngs4_np_assembly
Illumina: Metagenome analysis Under active development!
ngs4_qf
ngs4_tax

1. Amplicon analysis pipeline

→ 16S rRNA genes (bacteria and archaea)

Bacterial/archaeal 16S rRNA gene amplicon data processing pipeline

ngs4_16S is a 16S rRNA gene amplicon analysis pipeline providing processing of raw reads to amplicon sequence variant (ASV) sequences, read count table and a phylogenetic tree of ASV sequences. The pipeline uses the tools fastp, cutadapt, vsearch, mafft, FastTree, NCBI blast, BLCA and R.

In principle the following steps are performed by the pipeline:

all raw reads are quality filtered
all primer sequences are removed
paired-end reads are stitched together
reads are sorted by length
reads are dereplicated (by default sorted by decreasing abundance)
reads are denoised, see UNOISE3
de novo chimera removal
reference-based chimera removal (reference SILVA NR99 138.1)
final set of ASVs
quality filtered reads are mapped back against the ASVs
blastn against SILVA
ASV count table generation
phylogenetic tree from ASVs
data formatting and curation (minimum of 85% query coverage + lineage correction for 16S rRNA gene amplicons)
final ASV count table
Optional: for a more robust classification use BLCA against SILVA or NCBIs 16S rRNA

The default configuration of the pipeline is for Illumina MiSeq paired-end reads using reagent kit v3 (2x 300 bp, 600 cycles) with the primer pair SD-Bact-0341-b-S-17 and S-D-Bact-0785-a-A-21 proposed by Klindworth et al. (2013). However, by changing the parameters of primer sequence, sequence length, ASV length this pipeline can be used for any overlapping paired-end bacterial or archaeal amplicon raw sequence data (see options). The script also performs a lineage correction (removing uncertain assignments from species to phylum based on percent identity: <98.7 species, <94.5 genus, <86.5 family, <82.0 order, <78.5 class, <75 phylum) as proposed by Yarza et al. (2014) to avoid over/misinterpretation of the blast classification.

A very basic R script (based on ampvis2) is provided to start your analyses. I highly recommend to fill the metadata file (metadata.tsv) with all information about the samples that you have at hand. For more information, microsud has compiled an extensive overview of available microbiome analysis tools.

Before you start you need demultiplexed forward and reverse paired-end reads, placed in one folder, and sample names must meet the following naming convention:

<Sample_name>_<forward=R1_or_reverse=R2>.fastq.gz

# Example
Sample_1_R1.fastq.gz
Sample_1_R2.fastq.gz
Sample_2_R1.fastq.gz
Sample_2_R2.fastq.gz
etc.

Afterwards you can start the pipeline (here with example data) to process your 16S rRNA gene amplicon data.

Run `ngs4_16S` on your data

ngs4_16S \
-i ~/ngs4ecoprod/ngs4ecoprod/example_data/16S \
-o ~/ngs4_16S \
-d ~/ngs4ecoprod/ngs4ecoprod/db/silva \
-p 3 -t 8

Options for `ngs4_16S`

         -i     Input folder containing paired-end fastq.gz
                Note: files must be named according to the following scheme
                Sample_name_R1.fastq.gz
                Sample_name_R2.fastq.gz
         -o     Output folder
         -d     Path to SILVA database
         -l     Optional: Minimum length of forward and reverse sequence in bp [default: 200]
         -q     Optional: Minimum phred score [default: 20]
         -p     Number of processes [default: 1]
         -t     Number of CPU threads per process [default: 1]
         -f     Forward primer [default: CCTACGGGNGGCWGCAG]
         -r     Reverse primer [default: GGATTAGATACCCBDGTAGTC]
                Note: Use the reverse complement sequence of your 16S rRNA gene reverse primer
         -a     Optional: Minimum length of amplicon [default: 400]
         -u     Optional: minsize of UNOISE [default: 8]
                Note: Only change under special circumstances, i.e., very low sample number
         -h     Print this help

Taxonomy assignment via BLCA using SILVA or NCBIs 16S rRNA gene database

Optional: Since ngs4_16S is using a "simple" blastn (megablast, best hit) to infer taxonomy of the ASVs you might want to use a more sophisticated approach for taxonomic assignment. You can use bayesian-based lowest common ancestor (BLCA) classification method on your data. This will take more computation time (depending on the diversity/amount of ASVs of your samples & your hardware) mainly due to BLCA performing a blastn and a clustalo alignment of the ASV sequences.

There are two scripts: ngs4_16S_blca which uses BLCA with the SILVA 138.1 database and ngs4_16S_blca_ncbi which uses BLCA against NCBIs 16S rRNA database.

To run BLCA with SILVA on your data after ngs4_16S has finished, process your data with ngs4_16S_blca as follows:

ngs4_16S_blca \
-i ~/ngs4_16S \
-d ~/ngs4ecoprod/ngs4ecoprod/db/silva \
-t 8

To run BLCA with NCBIs 16S rRNA gene database on your data after ngs4_16S has finished, process your data with ngs4_16S_blca_ncbi (Note: every time you start the script the most recent version of the database will be downloaded) as follows:

ngs4_16S_blca_ncbi -i ~/ngs4_16S -t 8

Output

`ngs4_16S`

ASV_sequences.fasta → FASTA file containing all ASVs from your dataset
ASV_table.tsv → ASV read count table including blast classification
ASV.tre → Phylogenetic tree of the ASV sequences
markergene_16S.R → Basic R-script to visualize and analyze your data
metadata.tsv → Template metadata file including SampleID
ngs4_16S_DATE_TIME.log → Pipeline log file

`ngs4_16S_blca`

ASV_table_BLCA.tsv → ASV read count table including BLCA SILVA classification
ngs4_16S_blca_DATE_TIME.log → Pipeline log file

`ngs4_16S_blca_ncbi`

ASV_table_BLCA_ncbi.tsv → ASV read count table including BLCA NCBI classification
ngs4_16S_blca_ncbi_DATE_TIME.log → Pipeline log file

→ 18S rRNA genes (eukaryotes)

This pipeline is intended for use with 18S rRNA gene amplicons and is very similar to the 16S rRNA gene pipeline, except that the Yarza correction is not applied to the blastn hits. Default settings are currently set to match the primer pair TAReuk454FWD1 and TAReukREV3 designed by Stoeck et al. (2015), but by tweaking the settings (primer sequences, amplicons size) can be adapted to other primers (paired-end sequences must overlap).

Run `ngs4_18S` on your data

ngs4_18S \
-i ~/raw_18S_data \
-o ~/ngs4_18S \
-d ~/ngs4ecoprod/ngs4ecoprod/db/silva \
-p 3 -t 8

Options for `ngs4_18S`

         -i     Input folder containing paired-end fastq.gz
                Note: files must be named according to the following scheme
                Sample_name_R1.fastq.gz
                Sample_name_R2.fastq.gz
         -o     Output folder
         -d     Path to SILVA database
         -l     Optional: Minimum length of forward and reverse sequence in bp [default: 200]
         -q     Optional: Minimum phred score [default: 20]
         -p     Number of processes [default: 1]
         -t     Number of CPU threads per process [default: 1]
         -f     Optional: Forward primer [default: CCAGCASCYGCGGTAATTCC]
         -r     Optional: Reverse primer [default: TYRATCAAGAACGAAAGT]
                Note: Use the reverse complement sequence of your 18S rRNA gene reverse primer
         -a     Optional: Minimum length of amplicon [default: 350]
         -u     Optional: minsize parameter of UNOISE [default: 8]
                Note: Only change under special circumstances, i.e., very low sample number
         -h     Print this help

Taxonomy assignment via BLCA using Silva database

ngs4_18S_blca \
-i ~/ngs4_18S \
-d ~/ngs4ecoprod/ngs4ecoprod/db/silva \
-p 3 -t 8

`ngs4_18S`

ASV_sequences.fasta → FASTA file containing all ASVs from your dataset
ASV_table.tsv → ASV read count table including blast classification
ASV.tre → Phylogenetic tree of the ASV sequences
markergene_18S.R → Basic R-script to visualize and analyze your data
metadata.tsv → Template metadata file including SampleID
ngs4_18S_DATE_TIME.log → Pipeline log file

`ngs4_18S_blca`

ASV_table_BLCA.tsv → ASV read count table including BLCA SILVA classification
ngs4_18S_blca_DATE_TIME.log → Pipeline log file

2. Metagenomics with Nanopore data

1. Quality filter long-read data

To ensure high quality long-reads, the first step is filtering your data with ngs_np_qf which includes a general quality filter with fastp and afterwards removal of barcode leftovers at the ends and/or in the middle of the long-reads with Porechop_ABI (an extension of Porechop).

Before you start you need your basecalled long-reads in one folder and your file names must meet the following naming convention:

<Sample_name>.fastq.gz

# Example
Sample_1.fastq.gz
Sample_2.fastq.gz
Sample_3.fastq.gz
etc.

Note: Before you can perform a test run with the example data, you have to download the example data (Zymo-gut-mock-Kit20 (SRR17913199) described in the NanoPhase paper):

ngs4_download_np_example -i ~/ngs4ecoprod/ngs4ecoprod/example_data

Quality filter your reads with:

ngs4_np_qf -i ~/ngs4ecoprod/ngs4ecoprod/example_data/nanopore -o ~/ngs4_np -p 3 -t 12

Options for `ngs4_np_qf`

         -i     Input folder containing nanopore raw data as fastq.gz
                Note: files must be named according to the following scheme (ending with .fastq.gz)
                SampleName.fastq.gz
         -o     Output folder
         -q     Optional: Minimum phred score [default: 15]
                Note: you might have to lower these for old chemistry/flow cells (<R10.4)
         -l     Optional: Minimum length of nanopore read [default: 500]
         -p     Number of processes [default: 1]
         -t     Number of CPU threads per process [default: 1]
         -h     Print this help

2. Taxonomic composition of long-reads (optional)

To get a "rough" estimate of the taxonomic composition of your metagenomes you can use ngs4_np_tax which is a combination of Kraken2 and Kaiju against NCBIs nt and nr, respectively. These tools use large databases and also some more RAM (up to 670 Gb) per process, however, with -m this can be reduced. The script will produce a read count table which you can then analyze in R.

Taxonomic classification of quality-filtered long-reads:

ngs4_np_tax -i ~/ngs4_np -d ~/ngs4ecoprod/ngs4ecoprod/db/ -p 1 -t 20 -m

Options for `ngs4_np_tax`

         -i     Folder containing quality filtered fastq.gz
         -d     Path to databases (kraken2 & kaiju)
         -p     Number of processes [default: 1]
                Note: per process you need 187-670 Gb of RAM
         -t     Number of CPU threads per process [default: 1]
         -m     Reduce RAM requirements to 187 Gb (--memory-mapping for kraken2), slower
                Note: If your database is NOT located on a SSD expect long processing times
         -h     Print this help

3. Assembly of long reads & generating metagenome assembled genomes

Now to the interesting part: assemble your quality filtered long-reads and generate metagenome assembled genomes (MAGs). This task is performed by NanoPhase which uses several tools to complete this task: metaWRAP, maxbin2, metabat2, semibin, checkm, GTDB-tk among others.

Assembly and binning of long-reads:

ngs4_np_assembly -i ~/ngs4_np -p 1 -t 20

Options for `ngs4_np_assembly`

         -i     Folder containing quality filtered fastq.gz
         -p     Number of processes [default: 1]
                Note: Better only use one process here - depending on your system
         -t     Number of CPU threads per process [default: 1]
         -h     Print this help

3. Metagenomics with Illumina data

1. Quality filtering of paired-end sequences

This will perform quality filtering on your raw sequence data. In detail low quality sequences will be removed, sequences will be trimmed if quality drops below the threshold, sequences will be polished according to the consensus if reads overlap. In addition adapter leftovers will be removed and possible leftovers of phiX.

Note:
There is one requirement for the script to work (see example files), your file names have to meet the following scheme:

<Sample_name>_<forward=R1_or_reverse=R2>.fastq.gz

# Example
Sample_1_R1.fastq.gz
Sample_1_R2.fastq.gz
Sample_2_R1.fastq.gz
Sample_2_R2.fastq.gz
etc.

Run the quality filter `ngs4_qf`

ngs4_qf -i ~/ngs4ecoprod/ngs4ecoprod/example_data -o ~/ngs4_test_run -d ~/ngs4ecoprod/ngs4ecoprod/db -p 3 -t 14

Options for `ngs4_qf`

         -i     Input folder containing paired-end fastq.gz
                Note: files must be named according to the following scheme
                Sample_name_R1.fastq.gz
                Sample_name_R2.fastq.gz
         -o     Output folder
         -d     Path to databases
         -l     Optional: Minimum length of sequence in bp [default: 50]
         -q     Optional: Minimum phred score [default: 20]
         -p     Number of processes [default: 1]
         -t     Number of CPU threads per process [default: 1]
         -h     Print this help

2. Taxonomic classification of quality-filtered paired-end reads

With this script you assign taxonomy to your data with Kraken2 and Kaiju. Both classifications will be merged while Kraken2 annotation (higher precision) is prioritized over Kaiju annotation (higher sensitivity). In the end you will have an relative abundance table with taxonomic assignments.

Note:
This step is RAM intensive, per process you need at least 187 (-m) or 670 Gb of RAM.
In addition, make sure you have both databases (nt & nr) located on a SSD drive!

ngs4_tax -i ~/ngs4_illumina -d ~/ngs4ecoprod/ngs4ecoprod/db -p 1 -t 10 -m

Options for `ngs4_tax`

         -i     Folder containing quality filtered fastq.gz
         -d     Path to databases (kraken2 & kaiju)
         -p     Number of processes (default: 1)
                Note: per process you need 187-670 Gb of RAM
         -t     Number of CPU threads per process (default: 1)
         -m     Reduce RAM requirements to 187 Gb (--memory-mapping for kraken2), slower
                Note: If you use -m and your database is NOT located on a SSD expect long processing times
         -h     Print this help

3. Assembly of quality filtered short-reads (on hold)

#ngs4_assemble -i ~/ngs4_illumina

Author

Dominik Schneider (dschnei1@gwdg.de)

Citation

Please cite all the sophisticated software tools and databases that are incorporated into ngs4ecoprod that you used in your analysis: software ngs4ecoprod environment

Since this repository currently has no associated publication, please cite via the GitHub link: https://github.com/dschnei1/ngs4ecoprod

tl;dr

→ Install ngs4ecoprod & download SILVA database

wget https://raw.githubusercontent.com/dschnei1/ngs4ecoprod/main/install_ngs4ecoprod
bash install_ngs4ecoprod -i ~/ngs4ecoprod
source ~/.bashrc
activate_ngs4ecoprod
rm -f install_ngs4ecoprod
ngs4_download_silva_db -i ~/ngs4ecoprod/ngs4ecoprod/db

→ 16S rRNA gene amplicon pipeline on example (or your data)

ngs4_16S -i ~/ngs4ecoprod/ngs4ecoprod/example_data/16S -o ~/ngs4_16S -d ~/ngs4ecoprod/ngs4ecoprod/db/silva -p 3 -t 8

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
db		db
docs		docs
example_data		example_data
images		images
scripts		scripts
LICENSE		LICENSE
README.md		README.md
install_ngs4ecoprod		install_ngs4ecoprod
ngs-4-ecoprod.yml		ngs-4-ecoprod.yml

License

dschnei1/ngs4ecoprod

Folders and files

Latest commit

History

Repository files navigation

NGS-4-ECOPROD wrapper/pipeline collection

Table of contents

Installation

NOTE: After the recent updates of miniconda and mamba (12th of July), installation seems to work only on recent Linux distributions (2020 and up).

User installation, local

Admin installation, system wide

Note: Before first use please run GNU parallel once after activating your conda environment and agree to conditions to cite or pay GNU parallel

Install databases

Uninstall NGS-4-ECOPROD

Usage

1. Amplicon analysis pipeline

→ 16S rRNA genes (bacteria and archaea)

Bacterial/archaeal 16S rRNA gene amplicon data processing pipeline

Run ngs4_16S on your data

Options for ngs4_16S

Taxonomy assignment via BLCA using SILVA or NCBIs 16S rRNA gene database

Output

ngs4_16S

ngs4_16S_blca

ngs4_16S_blca_ncbi

→ 18S rRNA genes (eukaryotes)

Run ngs4_18S on your data

Options for ngs4_18S

Taxonomy assignment via BLCA using Silva database

ngs4_18S

ngs4_18S_blca

2. Metagenomics with Nanopore data

1. Quality filter long-read data

Quality filter your reads with:

Options for ngs4_np_qf

2. Taxonomic composition of long-reads (optional)

Taxonomic classification of quality-filtered long-reads:

Options for ngs4_np_tax

3. Assembly of long reads & generating metagenome assembled genomes

Assembly and binning of long-reads:

Options for ngs4_np_assembly

3. Metagenomics with Illumina data

1. Quality filtering of paired-end sequences

Run the quality filter ngs4_qf

Options for ngs4_qf