Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VerifyBamID: Ensuring Accurate .mu and .UD File Creation #61

Open
shadizaheri opened this issue Feb 21, 2024 · 1 comment
Open

VerifyBamID: Ensuring Accurate .mu and .UD File Creation #61

shadizaheri opened this issue Feb 21, 2024 · 1 comment
Assignees

Comments

@shadizaheri
Copy link

We aim to create the .mu and .UD resource files, which are auxiliary files for the SVDPrefix, using the CHM13 variant call format files (VCFs) available at this link. Our objective is to conduct contamination analysis on a Telomere-to-Telomere (T2T) reference sequence that has been adapted through a lift-over process.

As a preliminary step, we attempted to regenerate these resource files specifically for chromosome 19, following the instructions provided for VerifyBamID under the --RefVCF option, as detailed in the VerifyBamID GitHub repository. Chromosome 19 source vcf has 219 sites, but while running the –RefVCF command we were warned that two indel sites were skipped (please see the warning below).
skip indel at chr19:6785595 skip indel at chr19:17225618 NOTICE - Number of Markers:217 NOTICE - Number of Individuals:3202 NOTICE - Success!
As a result, we see 217 sites in our .mu and .UD generated files.

  • .mu: Mean Matrix Shape: (217,)
  • .UD: UD Matrix Shape: (217, 10)
    Below is the header information from the resulting files for your review:
    Screenshot 2024-02-21 at 5 26 47 PM

Could you advise on the following:

  1. Are there best practices or steps we should follow to ensure the .mu and .UD files are generated correctly or doing these quick checks are enough?
  2. How can we confirm that the missing indel sites won't affect our analysis, or how to include them if necessary?
    Thank you in advance for your help!
@Griffan Griffan self-assigned this Feb 24, 2024
@Griffan
Copy link
Owner

Griffan commented Feb 24, 2024

Hi @shadizaheri ,

VB2 was designed to work with common SNPs in the reference panel VCF which usually should be quite abundant and easy to randomly downsample. On Chromosome 19, there should be way more than 219 sites available(I understand you were trying to extract the intersection between your vcf and the original resource markers).

The recipe to create a new set of resource files is either to:

  1. Liftover existing resource files to the new reference coordinate system (for example, from hg19 -> hg38, and you will notice that's exactly what I did to the multiple reference versions in the existing resource files, provided that those markers do exist in both VCFs)
  2. Generate resource files directly from reference panel VCF that is called based on new reference system

In your case, I would suggest to

  1. preprocess the reference panel VCF file to only include common biallelic site, for example:
bcftools view -M2 -q 0.05  -v snps -f 'PASS,' 1KGP.CHM13v2.0.chr19.recalibrated.snp_indel.pass.vcf.gz
  1. randomly downsample to 10k markers(or 100k) across the entire genome(chr1-chr22 for 1KGP.CHM13v2.0.chr*.recalibrated.snp_indel.pass.vcf.gz)
  2. concatenate these individual chromosome-wise vcfs into a single vcf file
  3. fed the giant VCF file into VB2 —RefVCF

Let me know if this works and I'm happy to help.

Fan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants