Using gc_extrap on mitochondrial target enrichment data #73

eugeniosss · 2024-04-18T09:21:20Z

Hello!
Im working with gc_extrap with data has been enriched for mitochondrial genome (MT) and Im unsure what is the best way to proceed, or if is even adequate to use preseq with enriched data. I have bam files that have been generated by mapping to the full genome (i.e. nuclear + mitochondrial). So firstly, what Im doing:

Creata mr file from bam file: to-mr -o ${WHATEVER}.mr ${WHATEVER}.bam
Sort mr file: sort -k 1,1 -k 2,2n -k 3,3n ${WHATEVER}.mr -o ${WHATEVER}_sorted.mr
Intersect the sorted mr file with MT, to get mr file with reads only from MT: bedtools intersect -a ${WHATEVER}_sorted.mr -b MT.bed > ${WHATEVER}_MT.mr
Run preseq gc_extrap on the intersected mr file: preseq gc_extrap -v ${WHATEVER}_MT.mr -o ${WHATEVER}_gc_extrap.txt

Secondly, if I can do this or if there is any other way possible to run gc_extrap with enriched data, can I transform the output to be mean coverage across the mitochondria and number of raw sequenced reads, instead of total number of unique base pairs and total number of sequenced base pairs. I was trying to divide the number of unique base pairs by the size of the mitochondrial genome and the total number of sequenced base pairs by the reads length. However, for one sample its coverage is suppose to saturate at ~2 coverage. However, I have ~110 coverage, empirically calculated with GATK...

Am I doing something wrong, or am I unable to gc_extrap on my data?

andrewdavidsmith · 2024-04-18T14:36:47Z

@eugeniosss I'm not entirely sure if using preseq with enriched data makes sense. Usually for mitochondria (in data sets I see, where mitochondria is undesirable) the coverage of chrM is so much that it would overwhelm gc_extrap anyway and just indicate that you will saturate relatively early, and that might even cause numerical issues. But there is no need in your steps to do any "intersection", just do grep chrM ${WHATEVER}.mr > ${WHATEVER}.mr.chrM and then skip step 3. You might even just do this with samtools before you even start, so that your BAM format file only includes chrM.

eugeniosss · 2024-04-18T15:37:18Z

Thank you very much for your answer, it was helpful! Actually when running some tests, I had some problems with overwhelming preseq, so even had to do subsets. I would like to make a second question then, what about exome enriched data? I forgot to mention that I work with ancient DNA. And also most of the times, for the exome enriched samples, we do captures, sequence at low depth, do an evaluation of the reactions and, if the reactions were successful, we do a high depth sequencing. I was thinking preseq could be a cool thing to integrate in our pipeline. This way we could do an evaluation of the low depth captures and see how much more we could sequenced these. Any thoughts?

andrewdavidsmith · 2024-04-18T16:01:54Z

The issue with overwhelming preseq emerges when there is saturation and high coverage. I know preseq has been used successfully with exam data using test runs at lower coverage. This has been done with heavy multiplexing so that among a large number of libraries, the majority that "look good" are sequenced deeply, and the rest are re-made. So just like your situation.

But the best tool for this job might be the functionality that's only currently in preseqR. I can't remember right now if all the functions from preseqR have been released in the regular preseq. I had done some coding on that, and I just can't remember if I had finished it. Take a look and I'll try to respond in more detail later when I have more time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using gc_extrap on mitochondrial target enrichment data #73

Using gc_extrap on mitochondrial target enrichment data #73

eugeniosss commented Apr 18, 2024

andrewdavidsmith commented Apr 18, 2024

eugeniosss commented Apr 18, 2024

andrewdavidsmith commented Apr 18, 2024

Using gc_extrap on mitochondrial target enrichment data #73

Using gc_extrap on mitochondrial target enrichment data #73

Comments

eugeniosss commented Apr 18, 2024

andrewdavidsmith commented Apr 18, 2024

eugeniosss commented Apr 18, 2024

andrewdavidsmith commented Apr 18, 2024