Skip to content

SandraLouise/SVJedi-graph

Repository files navigation

SVJedi-graph : long-read SV genotyper with a variation graph

License install with bioconda

SVJedi-graph is a structural variation (SV) genotyper for long read data. It takes as input a variant file (VCF), a reference genome (fasta) and a long read file (fasta/fastq) and outputs the initial variant file with an additional column containing genotyping information (VCF).

SVjedi-graph is based on a representation of the genome and the different SV alleles in a variation graph. After building this variation graph from the reference genome sequence and the input variant file, long reads are mapped on this graph using minigraph1. Then it estimates the genotype of each variant in a given individual sample based on allele-specific alignment counts.

Currently, SVJedi-graph can genotype five types of SVs: deletions, insertions, duplications, inversions and translocations (intra- and inter-chromosomal).

Installation

SVJedi-graph requires :

With Conda

conda install -c bioconda svjedi-graph

Or

git clone https://gitlab.inria.fr/sromain/svjedi-graph.git

Usage

./svjedi-graph.py -v <inputVCF> -r <refFA> -q <longreadsFQ> [ -p <output_prefix> -t <threads> -ms <minsupport> ]

Input VCF requirements

For all variants, the SVTYPE tag must be present in the INFO field (SVTYPE=DEL or SVTYPE=INS or SVTYPE=INV or SVTYPE=BND). Insertions need to be sequence-resolved with the full inserted sequence characterized and reported in the ALT field of the VCF file. As duplications are a special case of insertions, SVJedi-graph supports also duplications, as long as their duplicated sequence is characterized and reported similarly to insertions. More details are given in SV representation in VCF.

Test with a small dataset

To check that SVJedi-graph behaves as expected on your device, you can run:

cd test-dir/
./run_test.sh

To explore the output files on a small dataset, run:

mkdir outputfiles
cd outputfiles
./../svjedi-graph.py -v ../test-dir/test.vcf -r ../test-dir/reference_genome.fasta -q ../test-dir/simulated_reads.fastq.gz -p test

Parameters

  • -v --vcf VCF file containing the set of SVs to genotype.
  • -r --ref FASTA file containing the reference genome (on which the SVs have been identified).
  • -q --reads FASTQ file containing the long reads used to genotype. If you have multiple FASTQ files for one individual, use , as a filename separator.
  • -p --prefix Prefix of output files.
  • -t --threads Number of threads to use for the mapping step.
  • -ms --minsupport Minimum number of alignments to genotype a SV (default: 3>=).

Output files

Main output file:

  • <prefix>_genotype.vcf Genotyped SVs set in VCF format.

Intermediate output files:

  • <prefix>.gfa Variation graph in GFA format.
  • <prefix>.gaf Mapping results from minigraph in GAF format.
  • <prefix>_informative_aln.json Json dictionnary of read supports for each input SV's alleles.

SV representation in VCF

Here are the information needed for SVJedi-graph to genotype the following SV types. All variants must have the CHROM and POS fields defined, with the chromosome names in the reference genome file and variant file that must be the same. The SVTYPE tag must be present in the INFO field (SVTYPE=DEL or SVTYPE=INS or SVTYPE=INV or SVTYPE=BND). Then additional information is required according to SV type:

  • Deletion

    • INFO field must contain SVTYPE=DEL
    • INFO field must contain END=pos (with pos being the end position of the deleted segment)
  • Insertion

    • INFO field must contain SVTYPE=INS
    • ALT field must contain the sequence of the insertion
  • Duplication

    • must be defined as an insertion event whith CHR and POS corresponding to the position of insertion of the novel copy
    • INFO field must contain SVTYPE=INS
    • ALT field must contain the sequence of the duplication
  • Inversion

    • INFO field must contain SVTYPE=INV
    • INFO field must contain END=pos tag, with pos being the second breakpoint position
  • Intra-chromosomal translocation

    • INFO field must contain SVTYPE=BND
    • ALT field must be formated as: t[pos[, t]pos], ]pos]t or [pos[t, with pos indicating the second breakpoint position and brackets directions indicating which parts of the two chromosomes should be joined together

Citation

Sandra Romain, Claire Lemaitre, SVJedi-graph: improving the genotyping of close and overlapping structural variants with long reads using a variation graph, Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i270–i278, https://doi.org/10.1093/bioinformatics/btad237

Contact

SVJedi-graph is a Genscale tool developed by Sandra Romain and Claire Lemaitre. For any bug report or feedback, please use the Github Issues form.


Footnotes

  1. Li, H., Feng, X. & Chu, C. The design and construction of reference pangenome graphs with minigraph. Genome Biol 21, 265 (2020). https://doi.org/10.1186/s13059-020-02168-z