Skip to content

A pipeline for variant calling from P. falciparum short reads generated from Illumina and ONT libraries

License

Notifications You must be signed in to change notification settings

kevin-wamae/PlasmoSeq-DualTech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A snakemake pipeline for variant calling from P. falciparum short amplicon reads

Motivation

PS: The pipeline is still at it's infancy stage

We sequenced an Illumina sequencing library on the Oxford Nanopore MinION (ONT) to evaluate the cost of this approach.

  • PCR amplicons from Plasmodium falciparum drug resistance markers (ama1, k13, dhps, dhfr and mdr1) were generated in duplicate.
  • Illumina sequencing libraries were generated using KAPA reagents and KAPA indexes.
  • Finally, ONT sequence libraries were generated using just one set of ONT adapters and sequenced on the ONT using the Flow Cell R9.4.1.
  • Hence, we cannot demultiplex the sequences into individual samples and further analyses were done at the population level.

Below are the project dependencies:

     Package management

  • conda - an open-source package management system and environment management system that runs on various platforms, including Windows, MacOS, Linux.

     Workflow management

  • snakemake - a workflow management system that aims to reduce the complexity of creating workflows by providing a fast and comfortable execution environment, together with a clean and modern specification language in python style.

     Bioinformatics tools (packages)

  • fastqc - a tool for a quality control tool for high throughput sequence data
  • multiqc - a tool for aggregating bioinformatics analysis reports across many samples and tools
  • porechop - a tool for finding and removing adapters from Oxford Nanopore reads. Adapters on the ends of reads are trimmed off, and when a read has an adapter in its middle, it is treated as chimeric and chopped into separate reads. Porechop performs thorough alignments to effectively find adapters, even at low sequence identity.
  • cutadapt - at tool that finds and removes adapter sequences, primers, poly-A tails and other types of unwanted sequence from your high-throughput sequencing reads.
  • bwa - an aligner for short-read alignment (see minimap2 for long-read alignment)
  • bedtools - allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF
  • bcftools - a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF.
  • snpEff - a genetic variant annotation and effect prediction toolbox
  • SnpSift - a toolbox that allows you to filter and manipulate annotated files.

Where to start

  • Clone this project into your computer using Git (installation instructions) with the following command:
    • git clone https://github.com/kevin-wamae/PlasmoSeq-DualTech.git
  • Navigate into the cloned directory using the following command:
    • cd PlasmoSeq-DualTech

Directory structure

  • Below is the default directory structure:
    • config/ - contains the workflow configuration files
    • env/ - contains the Conda environment files
    • input/ - contains fastq, adaptors and genome files
    • output/ - contains the output from the analysis
    • workflow/ - contains the Snakemake script (snakefile) and additonal scripts
.
├── LICENSE
├── README.md
├── config
│ └── config.yaml
├── env
│ └── environment.yml
├── input
│ ├── 01_fastq
│ │ ├── file-1_0.fastq.gz
│ │ └── file-2_1.fastq.gz
│ ├── 02_adapters
│ │ ├── illumina-TruSeq-adapters.fasta
│ │ └── illumina-indexes.txt
│ └── 03_genome
│     ├── genome.fasta
│     └── genome_annotations.gff
├── output
└── workflow
    ├── scripts
    │ └── create_snpeff_db.sh
    └── snakefile

Running the analysis

Install conda and execute the following commands:

1 - Create the conda analysis environment and install the dependencies from the env/environment.yml by running the following command in your terminal:

  • conda env create --file env/environment.yml

2 - Activate the conda environment:

  • PS - This needs to be done every time you want to execute this pipeline:
  • conda activate ampseq-analysis

3 - Create the snpEff database by executing the bash script below. This script will download P. falciparum genome files from PlasmoDB and create and a snpEff database:

  • PS - for this analysis, we will use genome data release-51 from PlasmoDB, and we only need to run it once:
  • bash workflow/scripts/create_snpeff_db.sh

4 - Finally, execute the whole Snakemake pipeline by running the following command in your terminal:

  • PS - Replace 4 in the command with the number of CPUs you wish to use
  • snakemake -c4

5 - Alternatively, you can execute a specific rule by running the following command in your terminal:

  • PS - Replace rule in the command with respective rule-name from the workflow/Snakefile
  • snakemake -c4 rule (for example snakemake -c4 qc_raw_files)

Expected output

Below is the expected directory structure of the output/ directory:

  • 01_snpeff_database/ - contains the snpEff database for variant calling
  • 02_qc_raw/ - contains the fastqc QC reports from the raw fastq files
  • 03_multiqc_raw/ - contains the aggregated fastqc QC reports
  • 04_trim_fastq_ont/ - contains fastq files after trimming ONT adaptors
  • 05_trim_fastq_illumina/ - contains fastq files after trimming Illumina adaptors
  • 06_qc_trimmed_files/ - contains the fastqc QC reports from the fastq files after quality trimming
  • 07_read_mapping/ - contains genome mapping files (index, bam and bed)
  • 08_variant_calling/ - contains variant calling files
output/
├── 01_snpeff_database
│   ├── P.falciparum
│   └── genomes
├── 02_qc_raw
├── 03_multiqc_raw
│   └── multiqc_data
├── 04_trim_fastq_ont
├── 05_trim_fastq_illumina
├── 06_qc_trimmed_filesmed
├── 07_read_mapping
│   └── genomeIndex
└── 08_variant_calling

About

A pipeline for variant calling from P. falciparum short reads generated from Illumina and ONT libraries

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published