Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF row validation error on gCNV results #8834

Open
MattWellie opened this issue May 15, 2024 · 9 comments
Open

VCF row validation error on gCNV results #8834

MattWellie opened this issue May 15, 2024 · 9 comments

Comments

@MattWellie
Copy link

MattWellie commented May 15, 2024

if (g.getAlleles().size() == 1 && g.getAllele(0).isNoCall()) {

I'm running into a recurrent issue in JointGermlineCNVSegmentation, running after PostprocessGermlineCNVCalls in a gCNV pipeline. A number of batches are being merged in parallel - some of those succeed, some fail. It's not clear just yet if this is a deterministic failure, I'll re-run a few times and see if I can answer that.

org.broadinstitute.hellbender.exceptions.GATKException: Exception thrown at chrX:6383391 [VC SAMPLE_ID.segments.vcf.gz @ chrX:6383391-17732942 Q3076.53 of type=NO_VARIATION alleles=[N*] attr={END=17732942} GT=GT:CN:NP:QA:QS:QSE:QSS	0:1:581:1:3077:4:20 filters=

...

Caused by: java.lang.IllegalStateException: Encountered genotype with ploidy 1 but 2 alleles.
	at org.broadinstitute.hellbender.utils.Utils.validate(Utils.java:814)
	at org.broadinstitute.hellbender.tools.walkers.sv.JointGermlineCNVSegmentation.correctGenotypePloidy(JointGermlineCNVSegmentation.java:701)
	at org.broadinstitute.hellbender.tools.walkers.sv.JointGermlineCNVSegmentation.prepareGenotype(JointGermlineCNVSegmentation.java:682)

The VCF row in question is

chrX	6383391	CNV_chrX_6383391_17732942	N	.	3076.53	.	END=17732942	GT:CN:NP:QA:QS:QSE:QSS	0:1:581:1:3077:4:20

The characterisation of this row as type=NO_VARIATION alleles=[N*] seems... partially correct? There is no variation at this locus, but I'm not sure why alleles is N*.

In this situation, as I read it, the first clause should be satisfied: 1 allele, and allele is no-call. Instead the variant process is dying in the else side of the condition. Could you clarify if I'm interpreting this correctly?

Relevant versioning:

13:18:38.320 INFO  JointGermlineCNVSegmentation - ------------------------------------------------------------
13:18:38.321 INFO  JointGermlineCNVSegmentation - The Genome Analysis Toolkit (GATK) v4.2.6.1-57-g9e03432-SNAPSHOT
13:18:38.321 INFO  JointGermlineCNVSegmentation - For support and documentation go to https://software.broadinstitute.org/gatk/
13:18:38.321 INFO  JointGermlineCNVSegmentation - Executing as root@hostname-2559a32a6e on Linux v5.19.0-1030-gcp amd64
13:18:38.321 INFO  JointGermlineCNVSegmentation - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_242-8u242-b08-0ubuntu3~18.04-b08
13:18:38.322 INFO  JointGermlineCNVSegmentation - Start Date/Time: March 19, 2024 1:18:37 PM GMT
13:18:38.322 INFO  JointGermlineCNVSegmentation - ------------------------------------------------------------
13:18:38.322 INFO  JointGermlineCNVSegmentation - ------------------------------------------------------------
13:18:38.343 INFO  JointGermlineCNVSegmentation - HTSJDK Version: 2.24.1
13:18:38.343 INFO  JointGermlineCNVSegmentation - Picard Version: 2.27.1
13:18:38.343 INFO  JointGermlineCNVSegmentation - Built for Spark Version: 2.4.5
@gokalpcelik
Copy link
Contributor

Hi @MattWellie
We had a fix for JointGermlineCNVSegmentation in version 4.3.0.0 here in the PR #7779
Can you check if 4.3.0.0 or later solves your problem?

@MattWellie
Copy link
Author

@gokalpcelik you might be my hero ❤️

@gokalpcelik
Copy link
Contributor

If the issue is solved I am closing it. If you still observe problems you may reopen it and we can take a look.

@MattWellie
Copy link
Author

Hi @gokalpcelik
I've re-run the same operation on 4.3, and on 4.5, with the exact same error :/
I've been pointed to #8164 which is the same issue

@mwalker174
Copy link
Contributor

@MattWellie What I suspect is happening is that the ploidy call from gCNV does not match your ped file. Is this sample a sex aneuploidy or have an incorrect ploidy assigned by gCNV? If so, try setting sex to unknown in your ped file (i.e. changing the 1 or 2 to a 0). We should probably improve the default behavior here however.

@MattWellie
Copy link
Author

Thanks @mwalker174 , we did a little digging and may have found a sample swap in the sample set being processed, but the swap is not the sample identified in the error log. I've got a couple more runs going now, hopefully I'll have more information soon.

@MattWellie
Copy link
Author

MattWellie commented May 22, 2024

Hi again, some updates! I've updated our PED with inferred rather than stated sex, which has resolved a number of these issues. As stated before, the samples which had incorrectly assigned sexes were not the named samples appearing the error logs, which made troubleshooting a challenge.

We're now running into issues with true sex aneuploidy, which the tool doesn't work for, as stated in your docs page. Do you have a way to cleanly handle these sex aneuploidies so that we can retain calls for the rest of the genome, rather than excluding the sample(s) entirely? This might be a process question, so I'm going to raise separately on the GATK-SV repo

FYI @cassimons

@MattWellie
Copy link
Author

We'd also be interested in finding out how the tool supports arbitrary autosomal ploidies, if you have any information on that we'd be keen to have a look!

@MattWellie
Copy link
Author

MattWellie commented May 22, 2024

I've pulled the problem VCF and a couple of successful ones locally and I can confirm that when running with 4 VCFs:

  • VCFs that succeeded through JointGermlineCNVSegmentation in our pipeline succeeded for me locally
  • The VCF that was flagged in the error message Exception thrown at chrX:6383391 [VC SAMPLE_ID.segments.vcf.gz ... completes just fine with other successful partners
  • The VCF that was not identified in the error message, but was inferred to be a sex chromosome aneuploidy causes a failure with any combination of other VCFs
  • If there are more than 2 VCFs run together, including the failing VCF/aneuploid sample, the error message indicates the problem originates in a non-aneuploid VCF, which misleading and makes this hard to treat. This behaviour was consistent in 4.5.0.0

Command used in my toy dataset:

gatk --java-options "-Xms4000M -Xmx6000M" JointGermlineCNVSegmentation -R /data/Homo_sapiens_assembly38_masked.fasta -O /data/out.vcf.gz -V /data/SAM1.segments.vcf.gz -V /data/SAM2.segments.vcf.gz -V /data/SAM3.segments.vcf.gz -V /data/SAM4.segments.vcf.gz --model-call-intervals /data/preprocessed.interval_list -ped /data/inferred_sex_pedigree.ped
  • In this configuration, SAM4 is aneuploid, and SAM1 is always the flagged VCF
  • If I remove SAM1 and re-run with 3 VCFs, SAM3 is mentioned in the error message. It's not derived from alphabetical order, first argument specified with -V, or first in the PED file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants