Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphtyper may introduce overlapped SVs in one individual #128

Open
jxcao98 opened this issue May 18, 2023 · 0 comments
Open

Graphtyper may introduce overlapped SVs in one individual #128

jxcao98 opened this issue May 18, 2023 · 0 comments

Comments

@jxcao98
Copy link

jxcao98 commented May 18, 2023

Hi,

Thanks for your wonderful tool.

I plan to perform joint-calling on 1300 individuals (30X WGS) following the Manta + Svimmer + Graphtyper pipeline. I have run the Manta for each individual and merged the Manta VCFs using Svimmer. Before doing the whole work, I tested Graphtyper on chromosome 1 of 300 individuals. Everything seems to work fine except for two minor issues:

1. --avg_cov_by_readlen seems not to work for me
I ran Graphtyper for the first time without any additional parameters. Graphtyper worked fast in most of the genome regions but very slowly in others. I found "problem regions" usually contained deep sequencing (e.g. 3000 x) bases and realized the --avg_cov_by_readlen is a necessary parameter to save time. Unfortunately, the running speed does not seem to be improved when I provided the avg_cov_by_readlen as you suggested. I think it should not take so much time after downsampling... Could you kindly explain the principle behind downsampling? Is it downsampling to the average sequencing depth, or still very deep?

2. There are overlapping SVs existing in one individual
I combined the Graphtyper results and kept only high-quality SVs according to the SI in your NC paper. Then I want to explore the improvement of re-genotyping to per-individual SV-detection. In this process, I found that two overlapped SVs could sometimes be reported for an individual.

I will take the individual B2128 as an example. The following two records are two SVs in the joint-VCF, they have partial overlap but are all genotyped as 0/1 in the same individual.

GT_DEL1:
chr1 14109813 chr1:14109813:DG N <DEL:SVSIZE=1933:AGGREGATED> 61898 PASS ABHet=0.3929;ABHom=0.9957;AC=1;AF=0.5634;AN=2;END=14111746;MaxAAS=38;MaxAASR=1;MaxAltPP=19;NHet=215;NHomAlt=128;NHomRef=75;NUM_MERGED_SVS=257;PASS_AC=423;PASS_AN=760;PASS_ratio=0.9091;QD=19.39;RefLen=1;SVLEN=1933;SVMODEL=AGGREGATED;SVSIZE=1933;SVTYPE=DEL;SV_ID=3;SeqDepth=10798;VarType=DG;NumCollapsed=1;NumConsolidated=0;CollapseId=31.0 GT:FT:AD:MD:DP:RA:PP:GQ:PL 0/1:PASS:11,8:0:19:11,8:8:99:125,0,200

GT_DEL2:
chr1 14109855 chr1:14109855:DG N <DEL:SVSIZE=305:AGGREGATED> 52871 PASS ABHet=0.5124;ABHom=0.9929;AC=1;AF=0.5598;AN=2;END=14110160;MaxAAS=51;MaxAASR=1;MaxAltPP=51;NHet=212;NHomAlt=128;NHomRef=78;NUM_MERGED_SVS=277;PASS_AC=438;PASS_AN=766;PASS_ratio=0.9163;QD=15.46;RefLen=1;SVLEN=305;SVMODEL=AGGREGATED;SVSIZE=305;SVTYPE=DEL;SV_ID=4;SeqDepth=12914;VarType=DG;PctSeqSimilarity=0;PctSizeSimilarity=0.1578;PctRecOverlap=0.1582;SizeDiff=1628;StartDistance=-42;EndDistance=1586;GTMatch=.;TruScore=10;MatchId=31.0 GT:FT:AD:MD:DP:RA:PP:GQ:PL 0/1:PASS:10,21:0:31:10,21:21:99:150,0,125

I checked the raw Manta results for this individual by using tabix ./Manta/B2128/results/variants/diploidSV.vcf.gz chr1:14109000-14112000. There are three deletions, including a very long SVs and two short DELs:

Manta_DEL1:
chr1 789425 MantaDEL:13:8098:8099:0:0:0 T <DEL> 25 MaxDepth END=224012321;SVTYPE=DEL;SVLEN=-223222896;IMPRECISE;CIPOS=-224,224;CIEND=-116,117 GT:FT:GQ:PL:PR 0/1:PASS:25:75,0,388:26,8

Manta_DEL2:
chr1 14109855 MantaDEL:2519:2:3:0:0:0 AGTTTCCCTTTACTTTTCTGATAGCTGTAAAAATCTGTTCTAAAAGGATGGAGGCATTTTTTTCCCCTACCATTTCTCTATAGCCTATGTTAATTTTGCTCTTTTCTTGCCACCCAATTTTGTTCTCTTCAGTCCTGTTCTCAGCTGGATCCTGGTGTTTCACTCCACATATTGAATAAGCAAAGCAATAATATGTTGTGATTAATAAGTGGCTTGACAGGCAGGAAAAAAGAAAATCTTATTCATTGCATCAGTGGTGCTGTGCAAATGCACTGTTTTTGAAAAATGCATTTAGTCAGCAGGGTG A 423 PASS END=14110160;SVTYPE=DEL;SVLEN=-305;CIGAR=1M305D;CIPOS=0,4;HOMLEN=4;HOMSEQ=GTTT GT:FT:GQ:PL:PR:SR 0/1:PASS:174:473,0,171:11,0:13,11

Manta_DEL3:
chr1 14110577 MantaDEL:2519:1:4:0:0:0 G <DEL> 168 PASS END=14112444;SVTYPE=DEL;SVLEN=-1867;CIPOS=0,2;CIEND=0,2;HOMLEN=2;HOMSEQ=GT GT:FT:GQ:PL:PR:SR 0/1:PASS:168:218,0,496:19,6:26,4

Here, GT_DEL2 is inherited from Manta_DEL2. But GT_DEL1, a 1933 bp deletion, is missing in the initial Manta results. I think the situation is reasonable because GT_DEL1 comes from other individuals in my cohort. In the other word, this may be the advantage of performing genotyping. However, these two nested SVs do appear in the same individual, so what do we make of them? Does this represent a complex event? Or is it simply a bias brought about by genotyping and one SV should be excluded?

This is a bit like Figure 2 of your NC paper, where the Graphtyper substantially improves the recall of the Manta, but at the expense of accuracy (Of course, this overlapping SV may not always mean a false positive). However, the presence of overlapping SVs within the same sample is very detrimental to a fair comparison of different detection tools, and to cohort-level statistical tests. What do you think about this?

This is just my humble opinion, Graphtyper is great, thanks again for the useful tool!

Sincerely,
Jixin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant