Skip to content

A WDL-based workflow for extraction of variants and their associated info from large VCF files

Notifications You must be signed in to change notification settings

anand-imcm/get-variant-info

Repository files navigation

WDL Workflow for Extracting Variant Information

Open   GitHub Workflow Status (with event)   GitHub release (with filter)  

Tip

To import the workflow into your Terra workspace, click on the above Dockstore badge, and select 'Terra' from the 'Launch with' widget on the Dockstore workflow page.

This repository contains a WDL (Workflow Description Language) workflow for extracting information from a set of imputed VCF files using a list of query variants or sample IDs.

The workflow extracts the following information:

  • Chromosome
  • Position
  • Reference allele
  • Alternate allele
  • Allele frequency (AF)
  • Minor allele frequency (MAF)
  • Imputation accuracy (R2)
  • Empirical R-square (ER2)
  • Genotype (GT)
  • Estimated Alternate Allele Dosage (DS)
  • Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 (GP)

The output is a set of files containing the extracted information.

Workflow Inputs

  • query_variants: A tab-delimited file with a list of query variants. Each line should be formatted as: Chromosome, Pos, ID, Ref, Alt. Each field should be separated by a tab. The Chromosome field should have a "chr" prefix (e.g., chr1, chr2, etc.). (required)
  • query_samples: A file with a list of sample IDs. Each line should contain one sample ID. (optional)
  • imputed_vcf: Array of imputed VCF files and their indices. VCF files should be in .vcf.gz format and indices in CSI or TBI format. (required)
  • prefix: Prefix for the output files. (required)
  • extract_item: A string specifying the information to extract from the FORMAT field of the VCF file. The available choices are GT, DS, and GP. Please provide as a comma-separated string. Example: GT,DS (required)
  • use_GT_from_PED: A boolean flag indicating whether to source the genotype encoding from a PED file generated by Plink2 software. If set to true, the genotype encoding will be sourced from the PED file. If not specified or set to false, the genotype encoding will be extracted from the VCF file. (optional)
  • match_pos_only: A boolean flag indicating whether to match the variants based on position only. If set to true, the variants will be matched based on chromosome and position only. If not specified or set to false, the variants will be matched based on chromosome, position, reference allele, and alternate allele from the query_variants file. (optional)

Workflow Outputs

  • SNP_INFO: *_extracted_SNP_INFO.tsv file contains the following columns:

    • CHROM:POS:REF:ALT: A combination of chromosome, position, reference allele, and alternate allele
    • CHROM: Chromosome
    • POS: Position
    • REF: Reference allele
    • ALT: Alternate allele
    • AF: Allele frequency
    • MAF: Minor allele frequency
    • R2: Imputation accuracy
    • ER2: Empirical R-square
    • INFO: Additional information indicating if the variant was imputed, typed, or typed only
  • genotype_info: *_extracted_GT.csv file contains the following columns (only generated if GT is specified as input in extract_item parameter of the workflow):

    • IID: Sample ID
    • CHROM:POS:REF:ALT: A combination of chromosome, position, reference allele, and alternate allele. The values correspond to the genotype for each sample, following a custom order: Both 0/1 and 1/0 are represented as Ref/Alt, 0/0 is represented as Ref/Ref, and 1/1 is represented as Alt/Alt. If the use_GT_from_PED flag is set to true, the genotype encoding will be sourced from a PED file generated by Plink2 software.
  • dosage_info: *_extracted_DS.csv file contains the following columns (only generated if DS is specified as input in extract_item parameter of the workflow):

    • IID: Sample ID
    • CHROM:POS:REF:ALT: A combination of chromosome, position, reference allele, and alternate allele, with the values corresponding to the estimated alternate allele dosage for each sample.
  • geno_prob_info: *_extracted_GP.csv file contains the following columns (only generated if GP is specified as input in extract_item parameter of the workflow):

    • IID: Sample ID
    • CHROM:POS:REF:ALT: A combination of chromosome, position, reference allele, and alternate allele, with the values corresponding to the estimated posterior probabilities for genotypes 0/0, 0/1, and 1/1 for each sample.

Components

  • Python packages

    • pysam==0.22.0
    • pandas==2.2.1
  • Tools

  • Containers

    • ghcr.io/anand-imcm/get-variant-info