Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using gc_extrap on mitochondrial target enrichment data #73

Open
eugeniosss opened this issue Apr 18, 2024 · 3 comments
Open

Using gc_extrap on mitochondrial target enrichment data #73

eugeniosss opened this issue Apr 18, 2024 · 3 comments

Comments

@eugeniosss
Copy link

Hello!
Im working with gc_extrap with data has been enriched for mitochondrial genome (MT) and Im unsure what is the best way to proceed, or if is even adequate to use preseq with enriched data. I have bam files that have been generated by mapping to the full genome (i.e. nuclear + mitochondrial). So firstly, what Im doing:

  1. Creata mr file from bam file: to-mr -o ${WHATEVER}.mr ${WHATEVER}.bam
  2. Sort mr file: sort -k 1,1 -k 2,2n -k 3,3n ${WHATEVER}.mr -o ${WHATEVER}_sorted.mr
  3. Intersect the sorted mr file with MT, to get mr file with reads only from MT: bedtools intersect -a ${WHATEVER}_sorted.mr -b MT.bed > ${WHATEVER}_MT.mr
  4. Run preseq gc_extrap on the intersected mr file: preseq gc_extrap -v ${WHATEVER}_MT.mr -o ${WHATEVER}_gc_extrap.txt

Secondly, if I can do this or if there is any other way possible to run gc_extrap with enriched data, can I transform the output to be mean coverage across the mitochondria and number of raw sequenced reads, instead of total number of unique base pairs and total number of sequenced base pairs. I was trying to divide the number of unique base pairs by the size of the mitochondrial genome and the total number of sequenced base pairs by the reads length. However, for one sample its coverage is suppose to saturate at ~2 coverage. However, I have ~110 coverage, empirically calculated with GATK...

Am I doing something wrong, or am I unable to gc_extrap on my data?

@andrewdavidsmith
Copy link
Contributor

@eugeniosss I'm not entirely sure if using preseq with enriched data makes sense. Usually for mitochondria (in data sets I see, where mitochondria is undesirable) the coverage of chrM is so much that it would overwhelm gc_extrap anyway and just indicate that you will saturate relatively early, and that might even cause numerical issues. But there is no need in your steps to do any "intersection", just do grep chrM ${WHATEVER}.mr > ${WHATEVER}.mr.chrM and then skip step 3. You might even just do this with samtools before you even start, so that your BAM format file only includes chrM.

@eugeniosss
Copy link
Author

Thank you very much for your answer, it was helpful! Actually when running some tests, I had some problems with overwhelming preseq, so even had to do subsets. I would like to make a second question then, what about exome enriched data? I forgot to mention that I work with ancient DNA. And also most of the times, for the exome enriched samples, we do captures, sequence at low depth, do an evaluation of the reactions and, if the reactions were successful, we do a high depth sequencing. I was thinking preseq could be a cool thing to integrate in our pipeline. This way we could do an evaluation of the low depth captures and see how much more we could sequenced these. Any thoughts?

@andrewdavidsmith
Copy link
Contributor

The issue with overwhelming preseq emerges when there is saturation and high coverage. I know preseq has been used successfully with exam data using test runs at lower coverage. This has been done with heavy multiplexing so that among a large number of libraries, the majority that "look good" are sequenced deeply, and the rest are re-made. So just like your situation.

But the best tool for this job might be the functionality that's only currently in preseqR. I can't remember right now if all the functions from preseqR have been released in the regular preseq. I had done some coding on that, and I just can't remember if I had finished it. Take a look and I'll try to respond in more detail later when I have more time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants