Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PrintReads introduces N bases when encoding some CRAMs and changes sequence #8768

Open
1 task
ilyasoifer opened this issue Apr 9, 2024 · 14 comments
Open
1 task

Comments

@ilyasoifer
Copy link
Collaborator

Bug Report

Affected tool(s) or class(es)

gatk PrintReads, picard MarkDuplicates

Affected version(s)

  • 4.4.0.0 (htsjdk 3.0.5, picard 3.0.0)

Description

I apologize in advance if this issue belongs to htsjdk. When we work with some of the CRAMs and pass them through PrintReads or picard MarkDuplicates, "N" bases get introduced.

We think that the problem happens when PrintReads write the CRAM rather than reading it, because if the output of PrintReads is a BAM, it does not happen.

We also noticed that this issue does not happen with earlier GATK (4.2.6.1), HTSJDK 2.24.1.

Happy to share the input files

Steps to reproduce

gatk PrintReads --input ultMerge.mt.cram --output ultMerge.mt.printreads.cram -R /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta
gatk PrintReads --input ultMerge.mt.cram --output ultMerge.mt.printreads.bam -R /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta
samtools view ultMerge.mt.gatk_printreads.cram | grep "038958_1-Z0011-5346565226"
038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	AS:i:54	X1:i:0	XD:Z:CATGAGG_GGTGATC	a3:i:92	bi:Z:5346565226	f1:Z:CATGAGG	f2:Z:GGTGATC	i1:Z:Q44	i2:Z:Q27	pr:i:22	pt:i:15	px:i:3813	py:i:1262	rq:f:0.03	si:i:3750	tm:Z:AQtq:i:195	MD:Z:0T0C0T0T0C0A0C0C0A0A0A0G0A0G0C0C0C0C0T0A0A0A0A0C0C0C0G0C0C0A0C0A0T0C0T0A0C0C0A0T0C0A0C0C0C0T0C0T0A0C0A0T0C0A0	NM:i:54	RG:Z:Z0011


samtools view ultMerge.mt.gatk_printreads.bam | grep "038958_1-Z0011-5346565226"

038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	X1:i:0	f1:Z:CATGAGG	i1:Z:Q44	f2:Z:GGTGATC	i2:Z:Q27	a3:i:92	XD:Z:CATGAGG_GGTGATC	RG:Z:Z0011	AS:i:54	bi:Z:5346565226	si:i:3750	tm:Z:AQ	rq:f:0.03	tq:i:195	pr:i:22pt:i:15	px:i:3813	py:i:1262

Expected behavior

BAM and CRAM outputs should behave the same

Actual behavior

BAM and CRAM outputs behave differently

@gokalpcelik
Copy link
Contributor

These reads don't even look the same ?

GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT
TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA

@ilyasoifer
Copy link
Collaborator Author

That is true! Should I update the title of the issue to reflect this?

@gokalpcelik
Copy link
Contributor

Can you check the REF_CACHE? Or alternatively you may provide samtools the reference fasta during cram view.

@ilyasoifer
Copy link
Collaborator Author

Thanks @gokalpcelik!
Seems that -T does not affect much:

samtools view -T /data2/reference/Homo_sapiens_assembly19_1000genomes_decoy/Homo_sapiens_assembly19_1000genomes_decoy.fasta ultMerge.mt.gatk_printreads.cram | grep "038958_1-Z0011-5346565226"
038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	GTGGTTTTTTTNTNTTTTGTTTTTTTNTTTTTGTGTTTTGTTTTTGTGTTTGTT	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	AS:i:54	X1:i:0	XD:Z:CATGAGG_GGTGATC	a3:i:92	bi:Z:5346565226f1:Z:CATGAGG	f2:Z:GGTGATC	i1:Z:Q44	i2:Z:Q27	pr:i:22	pt:i:15	px:i:3813	py:i:1262	rq:f:0.03	si:i:3750	tm:Z:AQ	tq:i:195	MD:Z:0T0C0T0T0C0A0C0C0A0A0A0G0A0G0C0C0C0C0T0A0A0A0A0C0C0C0G0C0C0A0C0A0T0C0T0A0C0C0A0T0C0A0C0C0C0T0C0T0A0C0A0T0C0A0	NM:i:54	RG:Z:Z0011

@ilyasoifer ilyasoifer changed the title PrintReads introduces N bases when encoding some CRAMs PrintReads introduces N bases when encoding some CRAMs and changes sequence Apr 9, 2024
@cmnbroad
Copy link
Collaborator

cmnbroad commented Apr 9, 2024

@ilyasoifer Is there any way I can access the original cram (or better yet, a small subset thereof consisting of just MT) that illustrates this issue) and the reference ? It might be hard to debug without that. If thats not possible, a few suggestions: can you try using PrintReads to write the original cram (I would try just MT) first to a cram, then to a sam, and also the original cram to a sam, and see how those compare? It would also be useful to see what that read looks like if you use samtools view on the ORIGINAL cram. Do you know what software/version was used to write the original cram ?

@ilyasoifer
Copy link
Collaborator Author

Hi @cmnbroad - thanks! I can upload to one of the buckets that are shared between Ultima and the Broad.
Could you provide your gcp account so I can give you permissions?
I am guessing that Megan Shand can share it through our joint slack channel if you prefer.

The original file was created using samtools v1.17.
I provided above the result of writing BAM and Cram with PrintReads. Is it helpful or you prefer SAM? And if I just do
samtools view ultMerge.mt.cram I get

038958_1-Z0011-5346565226	0	MT	3470	60	54M	*	0	0	TCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCA	DDDDDDDDDDDDDDDDDDD:DD:DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD	bi:Z:5346565226	rq:f:0.03	pr:i:22	pt:i:15	px:i:3813	py:i:1262	si:i:3750	tq:i:195	tm:Z:AQ	i1:Z:Q44	f1:Z:CATGAGG	i2:Z:Q27	f2:Z:GGTGATC	a3:i:92	XD:Z:CATGAGG_GGTGATC	X1:i:0	AS:i:54	MD:Z:54	NM:i:0	RG:Z:Z0011

@cmnbroad
Copy link
Collaborator

cmnbroad commented Apr 9, 2024

@ilyasoifer cnorman@broadinstitute.org. And don't worry about doing the PrintReads conversions I requested - if I have access to the original file and the reference I can debug this directly.

@ilyasoifer
Copy link
Collaborator Author

@cmnbroad, OK, great. Shared the files with Megan, she will transfer them over!

@ilyasoifer
Copy link
Collaborator Author

@cmnbroad - the reference is from here: gs://gcp-public-data--broad-references/Homo_sapiens_assembly19_1000genomes_decoy/

@cmnbroad
Copy link
Collaborator

cmnbroad commented Apr 9, 2024

Thanks for reporting this @ilyasoifer - it looks like a serious bug. I see whats going on and am working on a fix.

@ilyasoifer
Copy link
Collaborator Author

Thank you, good to hear that it was useful

@ilyasoifer
Copy link
Collaborator Author

@cmnbroad - hope all is well. I was wondering if there is any ETA when the fix that you made will be released?

Thanks!
Ilya

@droazen
Copy link
Collaborator

droazen commented May 10, 2024

@ilyasoifer It will be part of the GATK 4.6 release, which should come out this month. We've also implemented a cram scanning tool that users can use to scan their crams to see if any are affected.

@ilyasoifer
Copy link
Collaborator Author

Ah, great, thanks for updating me!

@droazen droazen self-assigned this May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants