Skip to content
/ CoalQC Public

CoalQC:Quality assessment of genomic datasets for Coalescent inferences

Notifications You must be signed in to change notification settings

ceglab/CoalQC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

coalQC : Quality assessment of genomic datasets for Coalescent inferences.

Setting Up the Environment for running the wrapper and scripts.

Follow these steps:
 cd coalqc/scripts 
 pwd 

copy the path where all the scripts are present and use export path to "coalpath" to save in your working environment. i.e. path which you will obtain after executing pwd in your coalqc/scripts folder.

 export coalpath={paste the path here} 

Now, you have setup the environment for the scripts. Add the location of the coalqc folder to your PATH variable, you can then execute coalqc from anywhere on your computer.

Step 1: Preparing the bam alignment file.

Prior to assessing the quality of genomic regions for performing coalescent analysis, we need to prepare an alignment of the short read data against a reference genome in the bam format. While this can be done using a short-read alignment program of your choice, we recommend using the coalmap command that uses the bwa read mapper to perform the read mapping. The following programs need to be installed and present in the PATH: bwa, samtools

coalqc map -g genome -f fastqread1 -r fastqread2 -p prefix -n threads

The coalmapscript will perform indexing of genome fasta file using the bwa index command and will also align the short read data to the genome using the bwa mem command. You may use the coalmapmulti.sh command to map the reads from multiple individuals.

As an example, we will download the genome assembly of the worm Caenorhabditis elegans from the NCBI website and use the publicly available short read data from the wild isolate PB306.

#Download genome reference file


wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/985/GCF_000002985.6_WBcel235/GCF_000002985.6_WBcel235_genomic.fna.gz

#Uncompress the reference fasta file


gunzip GCF_000002985.6_WBcel235_genomic.fna.gz

#Download short read data files


wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR306/006/ERR3063486/ERR3063486_1.fastq.gz


wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR306/006/ERR3063486/ERR3063486_2.fastq.gz

#Use coalmap command to index the genome and map the short read data.


coalqc map -g GCF_000002985.6_WBcel235_genomic.fna -f ERR3063486_1.fastq.gz -r ERR3063486_2.fastq.gz -p cel_PB306_ERR3063486 -n 10

The following information about the genome assembly used and percent of reads mapped to the genome are reported.

Percent of N's or hardmasked bases is 0.

Analysis will be performed by using this fasta file but make sure you are not using hardmasked genome.

Now, mapping the reads to the reference genome.

Mapping finished

The mapped percentage for this assembly is 98.91%.

Now, calculating coverage.

The mean coverage of reads is 27.1474

Step 2: Repeat Identification and creating input for repeat analysis.

#Download the repeat masker output file from NCBI (or generate your own by running repeat masker).

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/985/GCF_000002985.6_WBcel235/GCF_000002985.6_WBcel235_rm.out.gz

Uncompress the repeat masker output and convert it to a BED format file. Note that the repeat type is encoded as a the fourth column.

gunzip GCF_000002985.6_WBcel235_rm.out.gz
cat GCF_000002985.6_WBcel235_rm.out|tail -n +4|awk '{print $5,$6,$7,$11}'|sed 's/ /\t/g' > GCF_000002985.6_WBcel235_rm.bed

Having created the read alignment file in bam format and the repeat location and type file in bed format, we are ready to start running the coalqc repeat module.

Step 3: Performing Repeat Effect Analysis.

coalqc repeat -g GCF_000002985.6_WBcel235_genomic.fna -b cel_PB306_ERR3063486/cel_PB306_ERR3063486.sort.bam -p cel_PB306_ERR3063486_repeats -n 10 -m GCF_000002985.6_WBcel235_rm.bed -t 0.1 -u 2.7e-09 -P 1*20+10*5+5*2+1 

Scripts:

coalqc - Wrapper Script.

coalmap - Single mapping module of coalqc.

coalmapmulti - Multiple mapping module of coalqc.

coalrep - Repeat effect module of coalqc.

coalcov - Coverage effect module of coalqc.

coalcall - Callability effect module of coalqc.

About

CoalQC:Quality assessment of genomic datasets for Coalescent inferences

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published