Variant calls after local realignment : is it the most accurate ? #763

Fred-07 · 2024-01-10T14:05:24Z

Dear all,

I would like to share an observation with you.
Deepvariant reported a lot of variants in a region where no reads are mapped.
It is the consequence of Deepvariant local realignment as expected.
But the realigned reads do not seem to belong to this region : the orignal alignment shows less variants than the local realignment.
The genome contains a big region repeated several times (see on the screenshots below). The repeats could be the reason why some reads are moved to the left. The given example is for X chromosome but another region in chromosome 1 (FLG2 gene region) showed a lot of variants.

Why the reads are moved to the left?
Are the reads moved to the most possible left position?

Original alignment and detected variants (bwa-mem)

Local realignment made by DeepVariant (X_140993145_140994144)

Example of reads that were moved:
Original alignment
NB501857:464:HH7FWBGXV:2:23210:26812:14806 99 X 140993994 50 79M = 140994064 149 CCAGATTCCTGTGAGCCGCTCCTTCTCCTCCACTTTAGTGAGTCTTTTCCAGAGTTCCCCTGAGAGAACTCAGAGTACT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEAEEEEE XA:Z:X,+140993784,79M,2; PG:Z:MarkDuplicates AS:i:74 XS:i:69 MD:Z:17C61 NM:i:1 RG:Z:DM_23_2198
NB501857:464:HH7FWBGXV:2:23210:26812:14806 147 X 140994064 57 79M = 140993994 -149 CAGAGTACTTTTGAGGGTTTTCCCCAGTCTCCTCTCCAGATTCCTGTGAGCTCCTCCTCCTCCTCCACTTTATTGAGTC AAEEEEEEEEEEEEEE<66EEEEEEEEEE/EAEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA XA:Z:X,-140994589,50M3D29M,4; PG:Z:MarkDuplicates AS:i:79 XS:i:67 MD:Z:79 NM:i:0 RG:Z:DM_23_2198
Local realignment
X:140993145-140994144/ X_140993145_140994144realigned_reads.bam X_140993145_140994144realigned_reads.bam.bai
frmascla@frt:DeepV-TEST$ samtools view Local/X_140993145_140994144realigned_reads.bam | grep NB501857:464:HH7FWBGXV:2:23210:26812:14806
NB501857:464:HH7FWBGXV:2:23210:26812:14806 99 X 140993784 50 79M = 140994064 149 CCAGATTCCTGTGAGCCGCTCCTTCTCCTCCACTTTAGTGAGTCTTTTCCAGAGTTCCCCTGAGAGAACTCAGAGTACT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEAEEEEE
NB501857:464:HH7FWBGXV:2:23210:26812:14806 19 X 140993854 57 79M = 140993994 -149 CAGAGTACTTTTGAGGGTTTTCCCCAGTCTCCTCTCCAGATTCCTGTGAGCTCCTCCTCCTCCTCCACTTTATTGAGTC AAEEEEEEEEEEEEEE<66EEEEEEEEEE/EAEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA

Original alignment, bam file
https://www.dropbox.com/scl/fi/c9tc01sdtf2sroxj3u3bj/original_alignment.bam?rlkey=jgxnyhyse2ekcu6t1s3l3lnnl&dl=0
Local realignment, bam file
https://www.dropbox.com/scl/fi/oqhny0s7h9hu3zcyrprig/X_140993145_140994144realigned_reads.bam?rlkey=zmbon72t19vjlcdht1zt6m5xg&dl=0

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6/docs/FAQ.md:
Yes

Setup

Operating system: CentOS7
DeepVariant version: 1.5.0
Installation method (Docker, built from source, etc.): Singularity container
Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?) Illumina 2x79bp paired-end sequencing, WES, reference genome=hg19

lucasbrambrink · 2024-01-16T21:21:55Z

Hi,

Thanks for pointing this out! We are going to investigate this issue more closely and provide updates as we make progress. Thank you for providing links to the files, those are extremely helpful.

akolesnikov · 2024-05-16T22:59:23Z

Local realignment algorithm works as following:

Candidate windows across the genome are selected for reassembly by looking for any evidence of possible genetic variation, such as mismatching or soft clipped bases. Window cannot exceed maximum size, usually 1000 bases.
De Bruijn graph is constructed using multiple fixed k-mer sizes (from 20 to 75, inclusive, with increments of 5) out of the reference genome bases for the candidate window, as well as all overlapping reads.
Candidate haplotypes are generated by traversing the assembly graphs and the top two most likely haplotypes are selected that best explain the read evidence.
Finally, each read is then realigned to its most likely haplotype.

To answer your question. Reads can move either direction as a result of the realignment. To investigate it further it would be helpful to see what haplotype was created as a result of the realignment. Would you be able to provide the reference file that you used?

Fred-07 · 2024-05-25T16:07:47Z

Thank you for these explanations.
Please find below the link for the reference genome (hg19, slightly modified)
https://www.dropbox.com/scl/fi/e9imci4i4rkv5vcpgujus/genome_PAR_masked.fasta.gz?rlkey=5fxpz3fbzy70glish6xwhckc9&dl=0

akolesnikov · 2024-06-04T22:32:36Z

Thank you for providing the reference. We were able to reproduce the issue locally.
Meanwhile one workaround I'd like to suggest to analyze the specific position or small region where local realignment doesn't work is to run DeepVariant without the local realignment. It can be done by adding --realign_reads=false to --make_examples_extra_args flag and specifying a region with --regions flag.

AndrewCarroll · 2024-06-04T22:52:34Z

Hi @Fred-07,

As @akolesnikov mentions, the reason that the reads are re-aligned in this way has to do with some fundamental aspects of how alignment penalties work. The alignment method, Smith Waterman, has alignment score penalties for gap open and gap extend. These penalties make the alignments you see as scoring higher than ones that open and extend a gap early in the repeat.

Those alignment penalties are determined from looking at mismatch, insertion, and deletion rates across large amounts of homologous sequences. They work very well for high sequence complexity regions which have standard types of DNA replication error. In this highly repetitive context, the ways that sequence (specifically repeats) can be extended or deleted is a bit different. It might, in theory, be possible to derive better gap extend scores for regions like this, but this is both very hard and a more fundamental problem. I don't think this case is one we can easily solve without very extensive work.

I hope that the @akolesnikov suggestion of turning off realigner in regions of interest will be satisfactory.

Thank you,
Andrew

pichuan assigned lucasbrambrink Jan 30, 2024

lucasbrambrink assigned akolesnikov and unassigned lucasbrambrink May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant calls after local realignment : is it the most accurate ? #763

Variant calls after local realignment : is it the most accurate ? #763

Fred-07 commented Jan 10, 2024

lucasbrambrink commented Jan 16, 2024

akolesnikov commented May 16, 2024

Fred-07 commented May 25, 2024

akolesnikov commented Jun 4, 2024

AndrewCarroll commented Jun 4, 2024

Variant calls after local realignment : is it the most accurate ? #763

Variant calls after local realignment : is it the most accurate ? #763

Comments

Fred-07 commented Jan 10, 2024

lucasbrambrink commented Jan 16, 2024

akolesnikov commented May 16, 2024

Fred-07 commented May 25, 2024

akolesnikov commented Jun 4, 2024

AndrewCarroll commented Jun 4, 2024