Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long read bam processing #43

Open
SHuang-Broad opened this issue Sep 8, 2022 · 6 comments
Open

Long read bam processing #43

SHuang-Broad opened this issue Sep 8, 2022 · 6 comments

Comments

@SHuang-Broad
Copy link

Hi,

We're experimenting with VerifyBamID2 on long reads contamination estimations.
The test sample is a 30X PacBio CCS BAM aligned to HG38.

We found that the process is taking a bit longer than expected:

  • it seems to be processing the BAM using a single thread and
  • it took approximately 3 hours for processing chr1 reads

I know this is an off-label use, so would love any suggestions on how to carry out this experiment (including submitting PRs).

Thank you!
Steve

@Griffan
Copy link
Owner

Griffan commented Sep 9, 2022

Hi, @SHuang-Broad
There was an old ticket for ONT reads, which should have touched parts of you issue:#32

It's known to be slow(potentially with large memory consumption) using default settings with htslib when reading long read bams. (Due to calibration of qualities) If possible, you can provide a subset of the pileup of the BAM file with "—Pileup" argument as the input(which bypassed BAM input processing step)

Let me know if this workaround resolved your issue.

@SHuang-Broad
Copy link
Author

Thanks for pointing me in that direction!


One follow-up question: how do I generate the pileup file, with samtools mpileup?
There's a flag (--OutputPileup) in VerifyBamID; it cannot be running VerifyBamID with this option first, right?


In the meantime, I also run the following experiment:

  1. subset the 30X CCS BAM using the 4 BED files in the resources folder first, respectively, then
  2. run VerifyBamID using these BAMs against their matching BED file, respectively

It took approximately 4 hours using 10k sites (1kg and hdp).
The two runs with 100k sites are still running (9 hours now).
All are single-threaded.

Thanks!
Steve

@SHuang-Broad
Copy link
Author

One more question out of curiosity: when you mention the slowness possibly comes from htslib parsing the long reads due to calibration of qualities, I wonder

  • would updating to the latest version of htslib be helpful (I'm using docker v2.0.1)?
  • can you point me to where I can read more about this calibration of qualities?

Thank you!
Steve

@Griffan
Copy link
Owner

Griffan commented Sep 11, 2022

Thanks for pointing me in that direction!

One follow-up question: how do I generate the pileup file, with samtools mpileup? There's a flag (--OutputPileup) in VerifyBamID; it cannot be running VerifyBamID with this option first, right?

In the meantime, I also run the following experiment:

  1. subset the 30X CCS BAM using the 4 BED files in the resources folder first, respectively, then
  2. run VerifyBamID using these BAMs against their matching BED file, respectively

It took approximately 4 hours using 10k sites (1kg and hdp). The two runs with 100k sites are still running (9 hours now). All are single-threaded.

Thanks! Steve

"—OutputPileup" is used for debugging purposes in VB2 when the input file is a bam file. I don't think there will be a difference when the linked htslib in samtools and VB2 are the same.

I think you procedure is worth exploring, but you need to disable the quality calibration argument as disscussed in the referenced issue ticket #32 or refer to http://www.htslib.org/doc/samtools-mpileup.html BAQ related options.

@SHuang-Broad
Copy link
Author

Hi Fan,

Sorry, I forgot to provide an update.

I ran an experiment with different options, and here's what I found

VerifyBAMID on a 50X CCS BAM (GIAB HG002 GRCh38)

setup                                              wall-clock time         Alpha
no BAQ, 10K sites     35 minutes ( 15 minutes prep, 20 minutes VB)    0.00217684
BAQ                  100 minutes ( 80 minutes prep, 20 minutes VB)    0.00155812
100K sites           110 minutes ( 45 minutes prep, 55 minutes VB)    0.00220078
BAQ + 100K           520 minutes (460 minutes prep, 60 minutes VB)    0.00181208

So based on this little experiment,

  • contamination estimation seems to be affected by BAQ (no material difference though, it appears),
  • both BAQ option and # of sites affect CPU time (threads availability the same throughout)

The WDL pipeline I used is coded here.
Briefly, it does the BAM to mpileup conversion per chromosome in parallel and merges the text files later.
VerifyBAMID2 then picks up the merged pileup file and does its work, with 4 threads.

Thanks,
Steve

@Griffan
Copy link
Owner

Griffan commented Sep 18, 2022

@SHuang-Broad Thank you very much! This profiling is useful for future reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants