Skip to content

DupFinder: A tools to detect the duplicated genes using Illumina and Nanopore sequencing data

Notifications You must be signed in to change notification settings

assane-mbodj/dupfinder

Repository files navigation

Typing SVG

DupFinder: Detection tools of Duplicated genes using Illumina and Nanopore sequencing data

DupFinder, a gene duplication detection tool based on a combination of several variant calling tools and efficient filtering methods, the aim of which is to generate the broadest possible duplication catalogue minimising the false positive rate. DupFinder combines both short-read data from Illumina sequencing and long-read data from Nanopore sequencing.

Table of contents

Introduction

DUPFinder is a tool developed for the detection of gene duplications from next generation sequencing (NGS) data using paired-end Illumina reads. It is specifically designed for plant data but can work well with human data with a reference genome and gene annotation file.

The pipeline is built using nextflow, a workflow tool that makes it very easy to run tasks across multiple computational infrastructures. It uses containers like Docker or Singularity or cross-platform package and environment managers like Conda; these make the workflow more reproducible. The Nextflow implementation on this pipeline uses the Conda package manager which easily manages the maintenance and update of the software used by the pipeline as well as the dependencies.

Workflow of DupFinder

  • Aligning reads to a reference genome using bwa mem for Illumina data (short reads sequencing) and minimap2 for Nanopore data (long reads sequencing)
  • Calling CNVs using the structural variant callers on Illumina data Delly, Dysgu, Lumpy-sv and smoove
  • Calling CNVs using the structural variant callers on Nanopore data Sniffles, Svim, cuteSV
  • Post-processing each set of CNVs to keep the duplications and remove false positives Duphold, Bcftools
  • Merging all sets of duplications into one large set SURVIVOR
  • Detection of duplication gene using the annotation file Bedtools

Installation

Prerequisites

DupFinder can only be installed on Linux systems and requires Anaconda/Miniconda (Python 3.9+) to be present on the system.

All steps of DupFinder are run using the Nextflow (>=22.10) workflow language.

Getting Started

Quick installation using conda

#Step 1. Download the DupFinder :

git clone https://github.com/assane-mbodj/dupfinder

#Step 2. Go to dupfinder folder

cd dupfinder

#Step 3. Find the yaml file in the folder and run :

conda env create -f dupfinder_env.yml

bash install.sh

#Step 4. Activate the environnement dupfinder_env:

conda activate dupfinder_env

DupFinder test data

You can finally run the test.sh script with the command line below to see if DupFinder has been installed on your machine.

   bash test/test.sh

Index Reference genome for Illumina data

Before starting, create index file for the reference genome to reduce mapping time using the command following.

# build index accordingly

bwa index reference.fa 

Usage

DupFinder: Tool for detecting duplicate gene using Illumina and Nanopore sequencing data.

  DupFinder version: v2.0.0
 
   Usage:
	For Illumina data:
	nextflow run dupfinder.nf --sr --c file.config --genome_file reference.fa --reads_sr "pair_id_{1,2}.fastq" --annot file.bed --out Output_DupFinder

	For Nanopores data:
	nextflow run dupfinder.nf --lr --c file.config --genome_file reference.fasta --reads_lr "pair_id.fastq" --annot file.bed --out Output_DupFinder

    Command arguments DupFinder: The following parameters need to be specified when running DupFinder
    
	    --genome_file: Reference genome in FASTA format

	    --reads_sr: set of paired-end short reads in FASTQ format. Gzipped FASTQ files are allowed

	    --reads_lr: set of single-end long reads in FASTQ format. Gzipped FASTQ files are allowed

	    --sr: allow to run the short reads version

	    --lr: allow to run the long reads version

	    --annot: the file containing the gene annotation: it can be in gff or bed format and must be tabulated

	    --out: Output directory to which all results will be written

	    --c: Config file specifying the number of CPU cores and memory that will be assigned to DupFinder
	   	   	    
   Optional arguments:

	    -w: Working directory to which intermediate results will be written. Default: work

            -v:               version

Running multiple samples at once

DupFinder can be used to run multiple samples using a single command. For exemple if there existe several sample paired-end for Illumina or Single-end for Nanopore, they can all be processed using:

	For Illumina data:
	    nextflow run dupfinder.nf --sr --c file.config --genome_file reference.fa --reads_sr "*_{1,2}.fastq" --annot file.bed --out Output_DupFinder

	For Nanopore data:
	    nextflow run dupfinder.nf --lr --c file.config --genome_file reference.fa --reads_lr "*.fastq" --annot file.bed --out Output_DupFinder

OUTPUT

The outputs are specified on the table below variant_calls folder containing the CNV calls of the three callers, on the duplicate_annot_calls folder containing the annotated duplications, merge_vcf folder and on the duplicated_gene folder containing the gene duplications.

Overview

Col Type Description
1 folder Folder containing alignment files Bam_files
2 folder Folder containing Variants calling files variant_calls
3 folder Folder containing filtered duplicate regions files duplication_annot_calls
4 folder Folder containing gene duplications detected files detected_gene
5 folder Folder containing merging of all duplicate callers files merge_vcf

Contact

DupFinder

Copyright © 2023 Assane Mbodj (assanembodj11@gmail.com)

Any question, concern, or bug report about the program should be posted as an Issue on GitHub. Before posting, please check previous issues (both Open and Closed) to see if your issue has been addressed already. Also, please follow these good GitHub practices.