VBID2 may be underestimating intra-family contamination when applied to long reads (CCS) #46

SHuang-Broad · 2022-10-03T14:50:43Z

Hello Fan,

Continuing our effort to incorporate VBID2 into our pipeline for contamination estimation (see #43), we performed some experiments to evaluate its performance, given the fact that the tool was developed with short reads in mind.

Let me state the summary first, and post the experiment in subsequence posts.

Summary

In summary, it seems that VBID2 is underestimating contamination levels when

the contamination is intra-family, and
data type is CCS long reads,
and this may be caused by coverage estimation (or read picking) issues.

I am not sure how much time you have for addressing this issue.
But if you do, I can do more experiments as needed.

Thank you!

Steve

SHuang-Broad · 2022-10-03T14:51:07Z

Initial discovery: mostly fine, but seems to underestimate familial mixtures

In the initial experiment, we created in siloco mixture of CCS BAM files with different mixture levels.
See the table here.
As you can see, most of the estimations from VBID2 are close to the expected contamination levels.

mixture	expected_minor_pct	vbid2_contam_est	cov	major	minor
diff.pop.FF.0.1-1	0.0878	0.0878324	11.63	NA19238	HG00732
diff.pop.FF.0.2-1	0.1621	0.15659	12.65	NA19238	HG00732
diff.pop.FF.0.5-1	0.3317	0.315044	15.86	NA19238	HG00732
diff.pop.FF.1-1	0.4986	0.5	21.14	NA19238	HG00732
diff.pop.MF.0.1-1	0.0909	0.0866854	11.66	NA19238	HG00731
diff.pop.MF.0.2-1	0.1667	0.158202	12.72	NA19238	HG00731
diff.pop.MF.0.5-1	0.3333	0.317689	15.9	NA19238	HG00731
diff.pop.MF.1-1	0.4998	0.434521	21.19	NA19238	HG00731
diff.pop.MM.0.1-1	0.0946	0.0935125	9.52	NA19239	HG00731
diff.pop.MM.0.2-1	0.1641	0.158815	10.31	NA19239	HG00731
diff.pop.MM.0.5-1	0.33	0.314071	12.86	NA19239	HG00731
diff.pop.MM.1-1	0.4962	0.5	17.09	NA19239	HG00731
father.daughter.0.1-1	0.0914	0.045551	11.6	HG00733	HG00731
father.daughter.0.2-1	0.1675	0.086958	12.66	HG00733	HG00731
father.daughter.0.5-1	0.3346	0.230075	15.84	HG00733	HG00731
father.daughter.1-1	0.5012	0.360033	21.13	HG00733	HG00731
father.son.0.03-1	0.0294	0.0308993	38.04	NA24385	NA24149
father.son.0.1-1	0.0908	0.0911196	5.84	NA24385	NA24149
father.son.0.2-1	0.1664	0.159776	6.38	NA24385	NA24149
father.son.0.5-1	0.3338	0.320519	7.97	NA24385	NA24149
father.son.1-1	0.5005	0.498786	10.64	NA24385	NA24149
mother.daughter.0.1-1	0.0882	0.0862523	11.56	HG00733	HG00732
mother.daughter.0.2-1	0.1628	0.154435	12.59	HG00733	HG00732
mother.daughter.0.5-1	0.3329	0.314413	15.8	HG00733	HG00732
mother.daughter.1-1	0.5	0.426094	21.08	HG00733	HG00732
mother.son.0.03-1	0.0289	0.0323633	38.03	NA24385	NA24143
mother.son.0.1-1	0.091	0.0916356	9.34	NA24385	NA24143
mother.son.0.2-1	0.1668	0.167038	10.19	NA24385	NA24143
mother.son.0.5-1	0.3336	0.317348	12.74	NA24385	NA24143
mother.son.1-1	0.5003	0.418631	16.99	NA24385	NA24143
pure.HG00731		0.00157375	10.59	HG00731
pure.HG00732		0.00281968	14.64	HG00732
pure.HG00733		0.00241158	76.37	HG00733
pure.NA19238		0.00195742	10.6	NA19238
pure.NA19239		0.000970027	8.61	NA19239
pure.NA19240		0.00257564	31.16	NA19240
pure.NA24143		0.000532009	8.5	NA24143
pure.NA24149		0.000589194	5.32	NA24149
pure.NA24385		0.00208284	36.92	NA24385
same.pop.unrelated.0.1-1	0.0879	0.0888993	11.61	HG00731	HG00732
same.pop.unrelated.0.2-1	0.1622	0.158542	12.64	HG00731	HG00732
same.pop.unrelated.0.5-1	0.3319	0.318406	15.85	HG00731	HG00732
same.pop.unrelated.1-1	0.4988	0.5	21.13	HG00731	HG00732

However, notice the mixture that involves parental-daughter mixtures, where VBID2 meaningfully underestimates the contamination levels.

This table was generated in the following procedure

the BAM files that created the mixtures were downsampled to various levels with samtools.
VBID2 was run in the pipeline documented here, with
- the BAQ generation turned off, and
- the 1KG 10k sites BED file used.

SHuang-Broad · 2022-10-03T14:51:28Z

Maybe the downsampling is wrong? Merging the BAMs directly

In order to avoid any artifacts that could have been introduced in the downsampling step and the pileup conversion step in the pipeline, we simply merged the bams for the three HGSVC families (CHS, PUR, and YRI) files without any downsampling, and ran VBID2 directly on the merged BAMs (re-header-ed by changing the sample names to avoid an exception).

The results are similar but also different: notice the PUR family has VBID2 overestimates the contamination.

family	total_cov	expected_minor_pct	vbid2_contam_est
CHS_father_daughter	47.97	0.138	0.119902
CHS_mother_daughter	49.05	0.157	0.120124
PUR_father_daughter	86.96	0.122	0.195355
PUR_mother_daughter	91.01	0.161	0.189379
YRI_father_daughter	39.77	0.216	0.110855
YRI_mother_daughter	41.76	0.254	0.111905

And I started to notice that the coverage calculation by VBID2 is a little strange: for all the 6 mixtures in this section, here's the VBID2 coverage estimation output

family	total_cov	vbid2_cov_mean	vbid2_cov_sd
CHS_father_daughter	47.97	28.39	15.45
CHS_mother_daughter	49.05	29.39	15.46
PUR_father_daughter	86.96	42.62	31.37
PUR_mother_daughter	91.01	42.69	21.32
YRI_father_daughter	39.77	21.84	11.29
YRI_mother_daughter	41.76	21.93	11.30

SHuang-Broad · 2022-10-03T14:52:04Z

Using BAMs from HGSVC2

Suspecting that the BAM files we used may have some issues, we performed the above step using the data generated by HGSVC2.
The results are similar too

family	total_cov	expected_minor_pct	vbid2_contam_est
CHS_father_daughter	54.15	0.453	0.166
CHS_mother_daughter	65.21	0.376	0.165
PUR_father_daughter	66.45	0.499	0.158
PUR_mother_daughter	57.72	0.423	0.175
YRI_father_daughter	54.41	0.477	0.162
YRI_mother_daughter	53.44	0.468	0.163

family	total_cov	vbid2_cov_mean	vbid2_cov_sd
CHS_father_daughter	54.15	37.05	17.92
CHS_mother_daughter	65.21	44.09	21.29
PUR_father_daughter	66.45	46.58	21.37
PUR_mother_daughter	57.72	40.40	18.13
YRI_father_daughter	54.41	38.00	17.76
YRI_mother_daughter	53.44	37.42	17.72

SHuang-Broad · 2022-10-03T14:52:43Z

Using gNomAD short reads BAMs for the PUR family

In this case, contamination estimation from VBID2 is concordant with the expected levels.

family	parent_cov	daughter_cov	total_cov	vbid2_cov_mean	vbid2_cov_sd	expected_minor_pct	vbid2_contam_est
PUR_father_daughter	34.67	43.57	78.22	79.70	11.21	0.443	0.449
PUR_mother_daughter	38.9	43.57	82.46	83.60	11.56	0.472	0.464

Griffan · 2022-10-16T23:35:42Z

Hi, @SHuang-Broad
Sorry for the late reply, I've been really busy these days. Thank you very much for these detailed experiments!

"Initial discovery: mostly fine, but seems to underestimate familial mixtures"
Yes, I think this behavior is expected because the underlying assumption is based on four independent haplotypes among two unrelated individuals.
"Coverage estimation"
I think there are two points that could contribute to the difference in VB2: 1) I only take into account bases with sequencing quality large than the min_baseQ(13); 2) skipping variants that are not overlapping with the resource files. Maybe adjusting these two factors could explain part of the difference.

Thank you again for these experiments. I wonder if you can share with me part of your simulation data, I can try to do some experiments to adjust the underlying model to take these issues into account.

Fan

SHuang-Broad · 2022-10-17T02:25:43Z

Thanks for the reply, Fan!

I'll ask and see if I can share the data (they are on public samples, so it should be OK, but I need to confirm).

In the meantime, please let us know where we could contribute code (we are happy to see VBID2 work).

Thanks!
Steve

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VBID2 may be underestimating intra-family contamination when applied to long reads (CCS) #46

VBID2 may be underestimating intra-family contamination when applied to long reads (CCS) #46

SHuang-Broad commented Oct 3, 2022 •

edited

SHuang-Broad commented Oct 3, 2022

SHuang-Broad commented Oct 3, 2022

SHuang-Broad commented Oct 3, 2022

SHuang-Broad commented Oct 3, 2022

Griffan commented Oct 16, 2022

SHuang-Broad commented Oct 17, 2022

VBID2 may be underestimating intra-family contamination when applied to long reads (CCS) #46

VBID2 may be underestimating intra-family contamination when applied to long reads (CCS) #46

Comments

SHuang-Broad commented Oct 3, 2022 • edited

Summary

SHuang-Broad commented Oct 3, 2022

Initial discovery: mostly fine, but seems to underestimate familial mixtures

SHuang-Broad commented Oct 3, 2022

Maybe the downsampling is wrong? Merging the BAMs directly

SHuang-Broad commented Oct 3, 2022

Using BAMs from HGSVC2

SHuang-Broad commented Oct 3, 2022

Using gNomAD short reads BAMs for the PUR family

Griffan commented Oct 16, 2022

SHuang-Broad commented Oct 17, 2022

SHuang-Broad commented Oct 3, 2022 •

edited