intropipeline

NEWS:

🚀 A v.1.1 with several improvements in stability, speed and memory consumption has been released.

intropipeline

An automated computational framework for detecting Saccharomyces paradoxus introgressions in Saccharomyces cerevisiae strains from paired-end illumina sequencing.

Description

v1.0. is described in Tellini, et al. 2024 Nat. EcoEvo, for detecting S.par introgressions in S.cer strains.

v1.1. contains the following implementations and changes:

minimap2 replaced bwa mem almost halving the running time (see Heng Li 2018, Bioinformatics) achieving comparable results;

sample: ERR3010122

threads: 2

Architecture: x86_64

CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz

script Elapsed Time Maximum resident set size (GB)

bwa mem + samtools (v1) 6:21 (m:ss) 1.3

minimap2 + samtools (v1.1) 3:36 (m:ss) 1.3
improved the reproducibility of the mapping by implementing the standard samtools workflow according to samtools' guideline
improved the roboustness of the mapping by appending the name of the strain to a checkpoint (cps) file (./cps/cps.txt). The strains which names are stored in ./cps/cps.txt will not be mapped again.
introduced data.table, lapply and custom function for large file manipulation for reducing runtime and RAM load. example:

sample: ERR3010122

threads: 2

Architecture: x86_64

CPU: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz

script Elapsed Time (s) Maximum resident set size (GB)

parser_marker.r (v1) 0:17 s 0.8

parser_marker.r (v1.1) 0:06 s 0.5

clrs.r (v1) 0:49 s 1.9

clrs.r (v1.1) 0:17 s 0.7
introduced the variables nSamples and nThreads inside runner.sh. The first variable controls the number of samples to run in paralell and the second the per-samples number of threads. nSamples guarantees a contant number of samples running in parallel; as soon as the count drop of one sample an other will start to run. The definition of these variables affect the scripts minimap2.sh (which replaces bwa.sh), bcftools_markers.sh (which replaces samtools_marker.sh) and freec.sh;
corrected an error that prevented the detection of the CNVs;
Added a new approach for merging markers in blocks:

In v1 the markers are (1) genotyped, (2) filtered and (3) joined as long as they are consecutive and carry the same information. In v1.1 this does not change.

In v1.1 the markers are (1) ranked, (2) genotyped, (3) filtered, (4) joined as long as they are consecutive in the ranking and carry the same information. v1 did not use the ranking. Inevitably, this results in a more fragmented signal but provides a more realistic and faithful representation of the introgression reflecting regions where the genotyping was either discordant or failed. The ranking also represents the strategy that allowed the speedup of clrs.r (the script that generates the blocks).

Download

:

git clone --recursive https://github.com/nicolo-tellini/intropipeline.git

Content

📂 :

.
├── rep
│   ├── Ann
│   └── Asm
├── runner.sh
├── scr
└── seq

5 directories 1 file

rep : repository with assemblies, annotations and pre-computed marker table,
runner.sh : the script you edit and run,
scr : scripts,
seq : put the FASTQs files here,

Before starting

gzip -d ./rep/mrktab.gz

gzip -d ./rep/Asm/*gz

About the fastqs

Move the FASTQs inside ./seq/

Paired-end FASTQs data must be gziped and suffixed with .R1.fastq.gz and .R2.fastq.gz.

Default

./scr/bwa.sh uses 2 thread for sample (n.samples = 2).

./scr/samtools_markers.sh uses 1 thread for sample (n.samples = 4).

./scr/gem.sh uses 2 threads.

./scr/freec.sh uses 4 threads.

these values can be changed editing the scripts.

How to run

Edit runner.sh 📃

#!/bin/bash

#####################
### user settings ###
#####################

## S. paradoxus reference assembly

ref2Label="CBS432" ## choose the Spar assembly you think better fit the origin of your samples

## short labels (used to name file)

ref2="EU" ## choose a short name for Spar

# STEP 1
fastqQC="yes" ## fastqc control (required) ("yes","no" or "-" the last is skip)

# STEP 2
shortReadMapping="yes" ## ("yes","no")

# STEP 3
mrkgeno="yes" ## ("yes","no")

# STEP 4
cnv="yes" ## ("yes","no")

# STEP 5
intro="yes" ## ("yes","no")

#####################
### settings' end ###
#####################

Run runner.sh 🏃

nohup bash runner.sh &

The result

The results concerning the introgressions are stored in ./int

Ex.

An Alpechin strain:

How to interprer the result

Blue-Red plots provides an overview of potential introgressed DNA across the genome. The interpretation of the results is a process that require the integration of different data the pipeline produces.

❗ Reminder: blocks are defined as consecutive markers besring the same genomic info (Homo S.cer, Homo S.par, Het).

How are markers distributed inside the S.par block?

A couple of possible scenarious:

Case 1: abundant markers suporting the block

❗ Note: Only a few markers in the figure above are represented in the cartoon;

Case 2: not so abundant markers suporting the block

❗ Note: you should not exclude the possibility that a large events is supported by a low number of markers as in the example.

The number of markers supporting the blocks, the marker density and the info concerning the genotype are stored in int and int/AllSegments.

Dependencies

Softwares

FastQC
minimap2
samtools
bcftools
GEM v. 1.315 (beta) !! The GEM version used for the analyses is 1.759 (not available anymore).
Control-FREEC v. 11.6; makeGraph.R script was renamed makeplotcnv.R; A copy of all the scripts in FREEC/scripts/ is in scr. Nevertheless freec has to be installed
A copy of sambamba v. 0.6.5 is provided with the pipeline (no installation required)

R libraries

data.table
ggplot2
rtracklayer
R.filesets
GenomicRanges
purrr
dplyr
R.utilis

Find out more

Marker definition Methods

Citations

Please cite this paper when using intropipeline for your publications.

Ancient and recent origins of shared polymorphisms in yeast
Nicolò Tellini, Matteo De Chiara, Simone Mozzachiodi, Lorenzo Tattini, Chiara Vischioni, Elena S. Naumova, Jonas Warringer, Anders Bergström & Gianni Liti
Nature Ecologya and Evolution, 2024, https://doi.org/10.1038/s41559-024-02352-5

@article{tellini2024ancient,
  title={Ancient and recent origins of shared polymorphisms in yeast},
  author={Tellini, Nicol{\`o} and De Chiara, Matteo and Mozzachiodi, Simone and Tattini, Lorenzo and Vischioni, Chiara and Naumova, Elena S and Warringer, Jonas and Bergstr{\"o}m, Anders and Liti, Gianni},
  journal={Nature Ecology \& Evolution},
  pages={1--16},
  year={2024},
  publisher={Nature Publishing Group UK London}
}

Release history

v1.0 released in 2023
v1.1 released in 2024

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
rep		rep
scr		scr
seq		seq
.Rhistory		.Rhistory
CITATION.md		CITATION.md
README.md		README.md
runner.sh		runner.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rep

rep

scr

scr

seq

seq

.Rhistory

.Rhistory

CITATION.md

CITATION.md

README.md

README.md

runner.sh

runner.sh

Repository files navigation

intropipeline

Description

Download

Content

Before starting

About the fastqs

Default

How to run

The result

How to interprer the result

Dependencies

Softwares

R libraries

Find out more

Citations

Release history

About

Releases 2

Languages

script	Elapsed Time	Maximum resident set size (GB)
bwa mem + samtools (v1)	6:21 (m:ss)	1.3
minimap2 + samtools (v1.1)	3:36 (m:ss)	1.3

script	Elapsed Time (s)	Maximum resident set size (GB)
parser_marker.r (v1)	0:17 s	0.8
parser_marker.r (v1.1)	0:06 s	0.5
clrs.r (v1)	0:49 s	1.9
clrs.r (v1.1)	0:17 s	0.7

nicolo-tellini/intropipeline

Folders and files

Latest commit

History

Repository files navigation

intropipeline

Description

Download

Content

Before starting

About the fastqs

Default

How to run

The result

How to interprer the result

Dependencies

Softwares

R libraries

Find out more

Citations

Release history

About

Topics

Resources

Stars

Watchers

Forks

Languages