Skip to content

Detection of Saccharomyces paradoxus DNA across Saccharomyces cerevisiae, and vice versa.

Notifications You must be signed in to change notification settings

nicolo-tellini/intropipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

intropipeline logo

NEWS:

🚀 A v.1.1 with several improvements in stability, speed and memory consumption has been released.

intropipeline

Licence Release release date commit

An automated computational framework for detecting Saccharomyces paradoxus introgressions in Saccharomyces cerevisiae strains from paired-end illumina sequencing.

Sublime's custom image

Description

v1.0. is described in Tellini, et al. 2024 Nat. EcoEvo, for detecting S.par introgressions in S.cer strains.

v1.1. contains the following implementations and changes:

  • minimap2 replaced bwa mem almost halving the running time (see Heng Li 2018, Bioinformatics) achieving comparable results;

    sample: ERR3010122

    threads: 2

    Architecture: x86_64

    CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz

    script Elapsed Time Maximum resident set size (GB)
    bwa mem + samtools (v1) 6:21 (m:ss) 1.3
    minimap2 + samtools (v1.1) 3:36 (m:ss) 1.3
  • improved the reproducibility of the mapping by implementing the standard samtools workflow according to samtools' guideline

  • improved the roboustness of the mapping by appending the name of the strain to a checkpoint (cps) file (./cps/cps.txt). The strains which names are stored in ./cps/cps.txt will not be mapped again.

  • introduced data.table, lapply and custom function for large file manipulation for reducing runtime and RAM load. example:

    sample: ERR3010122

    threads: 2

    Architecture: x86_64

    CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz

    script Elapsed Time (s) Maximum resident set size (GB)
    parser_marker.r (v1) 0:17 s 0.8
    parser_marker.r (v1.1) 0:06 s 0.5
    clrs.r (v1) 0:49 s 1.9
    clrs.r (v1.1) 0:17 s 0.7
  • introduced the variables nSamples and nThreads inside runner.sh. The first variable controls the number of samples to run in paralell and the second the per-samples number of threads. nSamples guarantees a contant number of samples running in parallel; as soon as the count drop of one sample an other will start to run. The definition of these variables affect the scripts minimap2.sh (which replaces bwa.sh), bcftools_markers.sh (which replaces samtools_marker.sh) and freec.sh;

  • corrected an error that prevented the detection of the CNVs;

  • Added a new approach for merging markers in blocks:

    In v1 the markers are (1) genotyped, (2) filtered and (3) joined as long as they are consecutive and carry the same information. In v1.1 this does not change.

    In v1.1 the markers are (1) ranked, (2) genotyped, (3) filtered, (4) joined as long as they are consecutive in the ranking and carry the same information. v1 did not use the ranking. Inevitably, this results in a more fragmented signal but provides a more realistic and faithful representation of the introgression reflecting regions where the genotyping was either discordant or failed. The ranking also represents the strategy that allowed the speedup of clrs.r (the script that generates the blocks).

    Sublime's custom image

Download

:octocat: :

git clone --recursive https://github.com/nicolo-tellini/intropipeline.git

Content

📂 :

.
├── rep
│   ├── Ann
│   └── Asm
├── runner.sh
├── scr
└── seq

5 directories 1 file
  • rep : repository with assemblies, annotations and pre-computed marker table,
  • runner.sh : the script you edit and run,
  • scr : scripts,
  • seq : put the FASTQs files here,

Before starting

gzip -d ./rep/mrktab.gz

gzip -d ./rep/Asm/*gz

About the fastqs

Move the FASTQs inside ./seq/

Paired-end FASTQs data must be gziped and suffixed with .R1.fastq.gz and .R2.fastq.gz.

Default

./scr/bwa.sh uses 2 thread for sample (n.samples = 2).

./scr/samtools_markers.sh uses 1 thread for sample (n.samples = 4).

./scr/gem.sh uses 2 threads.

./scr/freec.sh uses 4 threads.

these values can be changed editing the scripts.

How to run

Edit runner.sh 📃

#!/bin/bash

#####################
### user settings ###
#####################

## S. paradoxus reference assembly

ref2Label="CBS432" ## choose the Spar assembly you think better fit the origin of your samples

## short labels (used to name file)

ref2="EU" ## choose a short name for Spar

# STEP 1
fastqQC="yes" ## fastqc control (required) ("yes","no" or "-" the last is skip)

# STEP 2
shortReadMapping="yes" ## ("yes","no")

# STEP 3
mrkgeno="yes" ## ("yes","no")

# STEP 4
cnv="yes" ## ("yes","no")

# STEP 5
intro="yes" ## ("yes","no")

#####################
### settings' end ###
#####################

Run runner.sh 🏃

nohup bash runner.sh &

The result

The results concerning the introgressions are stored in ./int

Ex.

An Alpechin strain:

res

How to interprer the result

Blue-Red plots provides an overview of potential introgressed DNA across the genome. The interpretation of the results is a process that require the integration of different data the pipeline produces.

Sublime's custom image

❗ Reminder: blocks are defined as consecutive markers besring the same genomic info (Homo S.cer, Homo S.par, Het).


How are markers distributed inside the S.par block?

A couple of possible scenarious:

Case 1: abundant markers suporting the block

Sublime's custom image

❗ Note: Only a few markers in the figure above are represented in the cartoon;

Case 2: not so abundant markers suporting the block

Sublime's custom image

❗ Note: you should not exclude the possibility that a large events is supported by a low number of markers as in the example.

The number of markers supporting the blocks, the marker density and the info concerning the genotype are stored in int and int/AllSegments.

Dependencies

Softwares

  • FastQC
  • minimap2
  • samtools
  • bcftools
  • GEM v. 1.315 (beta) !! The GEM version used for the analyses is 1.759 (not available anymore).
  • Control-FREEC v. 11.6; makeGraph.R script was renamed makeplotcnv.R; A copy of all the scripts in FREEC/scripts/ is in scr. Nevertheless freec has to be installed
  • A copy of sambamba v. 0.6.5 is provided with the pipeline (no installation required)

R libraries

Find out more

Marker definition Methods

Citations

Please cite this paper when using intropipeline for your publications.

Ancient and recent origins of shared polymorphisms in yeast
Nicolò Tellini, Matteo De Chiara, Simone Mozzachiodi, Lorenzo Tattini, Chiara Vischioni, Elena S. Naumova, Jonas Warringer, Anders Bergström & Gianni Liti
Nature Ecologya and Evolution, 2024, https://doi.org/10.1038/s41559-024-02352-5

@article{tellini2024ancient,
  title={Ancient and recent origins of shared polymorphisms in yeast},
  author={Tellini, Nicol{\`o} and De Chiara, Matteo and Mozzachiodi, Simone and Tattini, Lorenzo and Vischioni, Chiara and Naumova, Elena S and Warringer, Jonas and Bergstr{\"o}m, Anders and Liti, Gianni},
  journal={Nature Ecology \& Evolution},
  pages={1--16},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

Release history

  • v1.0 released in 2023
  • v1.1 released in 2024