DupFinder: Detection tools of Duplicated genes using Illumina and Nanopore sequencing data

DupFinder, a gene duplication detection tool based on a combination of several variant calling tools and efficient filtering methods, the aim of which is to generate the broadest possible duplication catalogue minimising the false positive rate. DupFinder combines both short-read data from Illumina sequencing and long-read data from Nanopore sequencing.

Introduction

DUPFinder is a tool developed for the detection of gene duplications from next generation sequencing (NGS) data using paired-end Illumina reads. It is specifically designed for plant data but can work well with human data with a reference genome and gene annotation file.

The pipeline is built using nextflow, a workflow tool that makes it very easy to run tasks across multiple computational infrastructures. It uses containers like Docker or Singularity or cross-platform package and environment managers like Conda; these make the workflow more reproducible. The Nextflow implementation on this pipeline uses the Conda package manager which easily manages the maintenance and update of the software used by the pipeline as well as the dependencies.

Workflow of DupFinder

Aligning reads to a reference genome using bwa mem for Illumina data (short reads sequencing) and minimap2 for Nanopore data (long reads sequencing)
Calling CNVs using the structural variant callers on Illumina data Delly, Dysgu, Lumpy-sv and smoove
Calling CNVs using the structural variant callers on Nanopore data Sniffles, Svim, cuteSV
Post-processing each set of CNVs to keep the duplications and remove false positives Duphold, Bcftools
Merging all sets of duplications into one large set SURVIVOR
Detection of duplication gene using the annotation file Bedtools

Installation

Prerequisites

DupFinder can only be installed on Linux systems and requires Anaconda/Miniconda (Python 3.9+) to be present on the system.

All steps of DupFinder are run using the Nextflow (>=22.10) workflow language.

Getting Started

Quick installation using conda

#Step 1. Download the DupFinder :

git clone https://github.com/assane-mbodj/dupfinder

#Step 2. Go to dupfinder folder

cd dupfinder

#Step 3. Find the yaml file in the folder and run :

conda env create -f dupfinder_env.yml

bash install.sh

#Step 4. Activate the environnement dupfinder_env:

conda activate dupfinder_env

DupFinder test data

You can finally run the test.sh script with the command line below to see if DupFinder has been installed on your machine.

   bash test/test.sh

Index Reference genome for Illumina data

Before starting, create index file for the reference genome to reduce mapping time using the command following.

# build index accordingly

bwa index reference.fa

Usage

DupFinder: Tool for detecting duplicate gene using Illumina and Nanopore sequencing data.

  DupFinder version: v2.0.0
 
   Usage:
	For Illumina data:
	nextflow run dupfinder.nf --sr --c file.config --genome_file reference.fa --reads_sr "pair_id_{1,2}.fastq" --annot file.bed --out Output_DupFinder

	For Nanopores data:
	nextflow run dupfinder.nf --lr --c file.config --genome_file reference.fasta --reads_lr "pair_id.fastq" --annot file.bed --out Output_DupFinder

    Command arguments DupFinder: The following parameters need to be specified when running DupFinder
    
	    --genome_file: Reference genome in FASTA format

	    --reads_sr: set of paired-end short reads in FASTQ format. Gzipped FASTQ files are allowed

	    --reads_lr: set of single-end long reads in FASTQ format. Gzipped FASTQ files are allowed

	    --sr: allow to run the short reads version

	    --lr: allow to run the long reads version

	    --annot: the file containing the gene annotation: it can be in gff or bed format and must be tabulated

	    --out: Output directory to which all results will be written

	    --c: Config file specifying the number of CPU cores and memory that will be assigned to DupFinder
	   	   	    
   Optional arguments:

	    -w: Working directory to which intermediate results will be written. Default: work

            -v:               version

Running multiple samples at once

DupFinder can be used to run multiple samples using a single command. For exemple if there existe several sample paired-end for Illumina or Single-end for Nanopore, they can all be processed using:

	For Illumina data:
	    nextflow run dupfinder.nf --sr --c file.config --genome_file reference.fa --reads_sr "*_{1,2}.fastq" --annot file.bed --out Output_DupFinder

	For Nanopore data:
	    nextflow run dupfinder.nf --lr --c file.config --genome_file reference.fa --reads_lr "*.fastq" --annot file.bed --out Output_DupFinder

OUTPUT

The outputs are specified on the table below variant_calls folder containing the CNV calls of the three callers, on the duplicate_annot_calls folder containing the annotated duplications, merge_vcf folder and on the duplicated_gene folder containing the gene duplications.

Overview

Col	Type	Description
1	folder	Folder containing alignment files Bam_files
2	folder	Folder containing Variants calling files variant_calls
3	folder	Folder containing filtered duplicate regions files duplication_annot_calls
4	folder	Folder containing gene duplications detected files detected_gene
5	folder	Folder containing merging of all duplicate callers files merge_vcf

Contact

DupFinder

Any question, concern, or bug report about the program should be posted as an Issue on GitHub. Before posting, please check previous issues (both Open and Closed) to see if your issue has been addressed already. Also, please follow these good GitHub practices.

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
module_lr		module_lr
module_sr		module_sr
modules		modules
test		test
workflow		workflow
README.md		README.md
dupfinder.nf		dupfinder.nf
dupfinder.yml		dupfinder.yml
dupfinder_env.yml		dupfinder_env.yml
install.sh		install.sh
main.nf		main.nf
main_lr.nf		main_lr.nf
nextflow.config		nextflow.config

assane-mbodj/dupfinder

Folders and files

Latest commit

History

Repository files navigation

DupFinder: Detection tools of Duplicated genes using Illumina and Nanopore sequencing data

Table of contents

Introduction

Workflow of DupFinder

Installation

Prerequisites

Getting Started

Quick installation using conda

DupFinder test data

Index Reference genome for Illumina data

Usage

Running multiple samples at once

OUTPUT

Contact

About

Topics

Resources

Stars

Watchers

Forks

Languages