Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VBID2 may be underestimating intra-family contamination when applied to long reads (CCS) #46

Open
SHuang-Broad opened this issue Oct 3, 2022 · 6 comments

Comments

@SHuang-Broad
Copy link

SHuang-Broad commented Oct 3, 2022

Hello Fan,

Continuing our effort to incorporate VBID2 into our pipeline for contamination estimation (see #43), we performed some experiments to evaluate its performance, given the fact that the tool was developed with short reads in mind.

Let me state the summary first, and post the experiment in subsequence posts.

Summary

In summary, it seems that VBID2 is underestimating contamination levels when

  • the contamination is intra-family, and
  • data type is CCS long reads,
  • and this may be caused by coverage estimation (or read picking) issues.

I am not sure how much time you have for addressing this issue.
But if you do, I can do more experiments as needed.

Thank you!

Steve

@SHuang-Broad
Copy link
Author

Initial discovery: mostly fine, but seems to underestimate familial mixtures

In the initial experiment, we created in siloco mixture of CCS BAM files with different mixture levels.
See the table here.
As you can see, most of the estimations from VBID2 are close to the expected contamination levels.

mixture expected_minor_pct vbid2_contam_est cov major minor
diff.pop.FF.0.1-1 0.0878 0.0878324 11.63 NA19238 HG00732
diff.pop.FF.0.2-1 0.1621 0.15659 12.65 NA19238 HG00732
diff.pop.FF.0.5-1 0.3317 0.315044 15.86 NA19238 HG00732
diff.pop.FF.1-1 0.4986 0.5 21.14 NA19238 HG00732
diff.pop.MF.0.1-1 0.0909 0.0866854 11.66 NA19238 HG00731
diff.pop.MF.0.2-1 0.1667 0.158202 12.72 NA19238 HG00731
diff.pop.MF.0.5-1 0.3333 0.317689 15.9 NA19238 HG00731
diff.pop.MF.1-1 0.4998 0.434521 21.19 NA19238 HG00731
diff.pop.MM.0.1-1 0.0946 0.0935125 9.52 NA19239 HG00731
diff.pop.MM.0.2-1 0.1641 0.158815 10.31 NA19239 HG00731
diff.pop.MM.0.5-1 0.33 0.314071 12.86 NA19239 HG00731
diff.pop.MM.1-1 0.4962 0.5 17.09 NA19239 HG00731
father.daughter.0.1-1 0.0914 0.045551 11.6 HG00733 HG00731
father.daughter.0.2-1 0.1675 0.086958 12.66 HG00733 HG00731
father.daughter.0.5-1 0.3346 0.230075 15.84 HG00733 HG00731
father.daughter.1-1 0.5012 0.360033 21.13 HG00733 HG00731
father.son.0.03-1 0.0294 0.0308993 38.04 NA24385 NA24149
father.son.0.1-1 0.0908 0.0911196 5.84 NA24385 NA24149
father.son.0.2-1 0.1664 0.159776 6.38 NA24385 NA24149
father.son.0.5-1 0.3338 0.320519 7.97 NA24385 NA24149
father.son.1-1 0.5005 0.498786 10.64 NA24385 NA24149
mother.daughter.0.1-1 0.0882 0.0862523 11.56 HG00733 HG00732
mother.daughter.0.2-1 0.1628 0.154435 12.59 HG00733 HG00732
mother.daughter.0.5-1 0.3329 0.314413 15.8 HG00733 HG00732
mother.daughter.1-1 0.5 0.426094 21.08 HG00733 HG00732
mother.son.0.03-1 0.0289 0.0323633 38.03 NA24385 NA24143
mother.son.0.1-1 0.091 0.0916356 9.34 NA24385 NA24143
mother.son.0.2-1 0.1668 0.167038 10.19 NA24385 NA24143
mother.son.0.5-1 0.3336 0.317348 12.74 NA24385 NA24143
mother.son.1-1 0.5003 0.418631 16.99 NA24385 NA24143
pure.HG00731   0.00157375 10.59 HG00731  
pure.HG00732   0.00281968 14.64 HG00732  
pure.HG00733   0.00241158 76.37 HG00733  
pure.NA19238   0.00195742 10.6 NA19238  
pure.NA19239   0.000970027 8.61 NA19239  
pure.NA19240   0.00257564 31.16 NA19240  
pure.NA24143   0.000532009 8.5 NA24143  
pure.NA24149   0.000589194 5.32 NA24149  
pure.NA24385   0.00208284 36.92 NA24385  
same.pop.unrelated.0.1-1 0.0879 0.0888993 11.61 HG00731 HG00732
same.pop.unrelated.0.2-1 0.1622 0.158542 12.64 HG00731 HG00732
same.pop.unrelated.0.5-1 0.3319 0.318406 15.85 HG00731 HG00732
same.pop.unrelated.1-1 0.4988 0.5 21.13 HG00731 HG00732

However, notice the mixture that involves parental-daughter mixtures, where VBID2 meaningfully underestimates the contamination levels.

This table was generated in the following procedure

  • the BAM files that created the mixtures were downsampled to various levels with samtools.
  • VBID2 was run in the pipeline documented here, with
    • the BAQ generation turned off, and
    • the 1KG 10k sites BED file used.

@SHuang-Broad
Copy link
Author

Maybe the downsampling is wrong? Merging the BAMs directly

In order to avoid any artifacts that could have been introduced in the downsampling step and the pileup conversion step in the pipeline, we simply merged the bams for the three HGSVC families (CHS, PUR, and YRI) files without any downsampling, and ran VBID2 directly on the merged BAMs (re-header-ed by changing the sample names to avoid an exception).

The results are similar but also different: notice the PUR family has VBID2 overestimates the contamination.

family total_cov expected_minor_pct vbid2_contam_est
CHS_father_daughter 47.97 0.138 0.119902
CHS_mother_daughter 49.05 0.157 0.120124
PUR_father_daughter 86.96 0.122 0.195355
PUR_mother_daughter 91.01 0.161 0.189379
YRI_father_daughter 39.77 0.216 0.110855
YRI_mother_daughter 41.76 0.254 0.111905

And I started to notice that the coverage calculation by VBID2 is a little strange: for all the 6 mixtures in this section, here's the VBID2 coverage estimation output

family total_cov vbid2_cov_mean vbid2_cov_sd
CHS_father_daughter 47.97 28.39 15.45
CHS_mother_daughter 49.05 29.39 15.46
PUR_father_daughter 86.96 42.62 31.37
PUR_mother_daughter 91.01 42.69 21.32
YRI_father_daughter 39.77 21.84 11.29
YRI_mother_daughter 41.76 21.93 11.30

@SHuang-Broad
Copy link
Author

Using BAMs from HGSVC2

Suspecting that the BAM files we used may have some issues, we performed the above step using the data generated by HGSVC2.
The results are similar too

family total_cov expected_minor_pct vbid2_contam_est
CHS_father_daughter 54.15 0.453 0.166
CHS_mother_daughter 65.21 0.376 0.165
PUR_father_daughter 66.45 0.499 0.158
PUR_mother_daughter 57.72 0.423 0.175
YRI_father_daughter 54.41 0.477 0.162
YRI_mother_daughter 53.44 0.468 0.163
family total_cov vbid2_cov_mean vbid2_cov_sd
CHS_father_daughter 54.15 37.05 17.92
CHS_mother_daughter 65.21 44.09 21.29
PUR_father_daughter 66.45 46.58 21.37
PUR_mother_daughter 57.72 40.40 18.13
YRI_father_daughter 54.41 38.00 17.76
YRI_mother_daughter 53.44 37.42 17.72

@SHuang-Broad
Copy link
Author

Using gNomAD short reads BAMs for the PUR family

In this case, contamination estimation from VBID2 is concordant with the expected levels.

family parent_cov daughter_cov total_cov vbid2_cov_mean vbid2_cov_sd expected_minor_pct vbid2_contam_est
PUR_father_daughter 34.67 43.57 78.22 79.70 11.21 0.443 0.449
PUR_mother_daughter 38.9 43.57 82.46 83.60 11.56 0.472 0.464

@Griffan
Copy link
Owner

Griffan commented Oct 16, 2022

Hi, @SHuang-Broad
Sorry for the late reply, I've been really busy these days. Thank you very much for these detailed experiments!

  1. "Initial discovery: mostly fine, but seems to underestimate familial mixtures"
    Yes, I think this behavior is expected because the underlying assumption is based on four independent haplotypes among two unrelated individuals.
  2. "Coverage estimation"
    I think there are two points that could contribute to the difference in VB2: 1) I only take into account bases with sequencing quality large than the min_baseQ(13); 2) skipping variants that are not overlapping with the resource files. Maybe adjusting these two factors could explain part of the difference.

Thank you again for these experiments. I wonder if you can share with me part of your simulation data, I can try to do some experiments to adjust the underlying model to take these issues into account.

Fan

@SHuang-Broad
Copy link
Author

Thanks for the reply, Fan!

I'll ask and see if I can share the data (they are on public samples, so it should be OK, but I need to confirm).

In the meantime, please let us know where we could contribute code (we are happy to see VBID2 work).

Thanks!
Steve

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants