Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variant calls after local realignment : is it the most accurate ? #763

Open
Fred-07 opened this issue Jan 10, 2024 · 5 comments
Open

Variant calls after local realignment : is it the most accurate ? #763

Fred-07 opened this issue Jan 10, 2024 · 5 comments
Assignees

Comments

@Fred-07
Copy link

Fred-07 commented Jan 10, 2024

Dear all,

I would like to share an observation with you.
Deepvariant reported a lot of variants in a region where no reads are mapped.
It is the consequence of Deepvariant local realignment as expected.
But the realigned reads do not seem to belong to this region : the orignal alignment shows less variants than the local realignment.
The genome contains a big region repeated several times (see on the screenshots below). The repeats could be the reason why some reads are moved to the left. The given example is for X chromosome but another region in chromosome 1 (FLG2 gene region) showed a lot of variants.

Why the reads are moved to the left?
Are the reads moved to the most possible left position?

Original alignment and detected variants (bwa-mem)
image

Local realignment made by DeepVariant (X_140993145_140994144)
image

Example of reads that were moved:
Original alignment
NB501857:464:HH7FWBGXV:2:23210:26812:14806 99 X 140993994 50 79M = 140994064 149 CCAGATTCCTGTGAGCCGCTCCTTCTCCTCCACTTTAGTGAGTCTTTTCCAGAGTTCCCCTGAGAGAACTCAGAGTACT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEAEEEEE XA:Z:X,+140993784,79M,2; PG:Z:MarkDuplicates AS:i:74 XS:i:69 MD:Z:17C61 NM:i:1 RG:Z:DM_23_2198
NB501857:464:HH7FWBGXV:2:23210:26812:14806 147 X 140994064 57 79M = 140993994 -149 CAGAGTACTTTTGAGGGTTTTCCCCAGTCTCCTCTCCAGATTCCTGTGAGCTCCTCCTCCTCCTCCACTTTATTGAGTC AAEEEEEEEEEEEEEE<66EEEEEEEEEE/EAEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA XA:Z:X,-140994589,50M3D29M,4; PG:Z:MarkDuplicates AS:i:79 XS:i:67 MD:Z:79 NM:i:0 RG:Z:DM_23_2198
Local realignment
X:140993145-140994144/ X_140993145_140994144realigned_reads.bam X_140993145_140994144realigned_reads.bam.bai
frmascla@frt:DeepV-TEST$ samtools view Local/X_140993145_140994144realigned_reads.bam | grep NB501857:464:HH7FWBGXV:2:23210:26812:14806
NB501857:464:HH7FWBGXV:2:23210:26812:14806 99 X 140993784 50 79M = 140994064 149 CCAGATTCCTGTGAGCCGCTCCTTCTCCTCCACTTTAGTGAGTCTTTTCCAGAGTTCCCCTGAGAGAACTCAGAGTACT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEAEEEEE
NB501857:464:HH7FWBGXV:2:23210:26812:14806 19 X 140993854 57 79M = 140993994 -149 CAGAGTACTTTTGAGGGTTTTCCCCAGTCTCCTCTCCAGATTCCTGTGAGCTCCTCCTCCTCCTCCACTTTATTGAGTC AAEEEEEEEEEEEEEE<66EEEEEEEEEE/EAEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAA

Original alignment, bam file
https://www.dropbox.com/scl/fi/c9tc01sdtf2sroxj3u3bj/original_alignment.bam?rlkey=jgxnyhyse2ekcu6t1s3l3lnnl&dl=0
Local realignment, bam file
https://www.dropbox.com/scl/fi/oqhny0s7h9hu3zcyrprig/X_140993145_140994144realigned_reads.bam?rlkey=zmbon72t19vjlcdht1zt6m5xg&dl=0

Have you checked the FAQ? https://github.com/google/deepvariant/blob/r1.6/docs/FAQ.md:
Yes

Setup

  • Operating system: CentOS7
  • DeepVariant version: 1.5.0
  • Installation method (Docker, built from source, etc.): Singularity container
  • Type of data: (sequencing instrument, reference genome, anything special that is unlike the case studies?) Illumina 2x79bp paired-end sequencing, WES, reference genome=hg19
@lucasbrambrink
Copy link
Collaborator

Hi,

Thanks for pointing this out! We are going to investigate this issue more closely and provide updates as we make progress. Thank you for providing links to the files, those are extremely helpful.

@akolesnikov
Copy link
Collaborator

Local realignment algorithm works as following:

  • Candidate windows across the genome are selected for reassembly by looking for any evidence of possible genetic variation, such as mismatching or soft clipped bases. Window cannot exceed maximum size, usually 1000 bases.
  • De Bruijn graph is constructed using multiple fixed k-mer sizes (from 20 to 75, inclusive, with increments of 5) out of the reference genome bases for the candidate window, as well as all overlapping reads.
  • Candidate haplotypes are generated by traversing the assembly graphs and the top two most likely haplotypes are selected that best explain the read evidence.
  • Finally, each read is then realigned to its most likely haplotype.

To answer your question. Reads can move either direction as a result of the realignment. To investigate it further it would be helpful to see what haplotype was created as a result of the realignment. Would you be able to provide the reference file that you used?

@Fred-07
Copy link
Author

Fred-07 commented May 25, 2024

Thank you for these explanations.
Please find below the link for the reference genome (hg19, slightly modified)
https://www.dropbox.com/scl/fi/e9imci4i4rkv5vcpgujus/genome_PAR_masked.fasta.gz?rlkey=5fxpz3fbzy70glish6xwhckc9&dl=0

@akolesnikov
Copy link
Collaborator

Thank you for providing the reference. We were able to reproduce the issue locally.
Meanwhile one workaround I'd like to suggest to analyze the specific position or small region where local realignment doesn't work is to run DeepVariant without the local realignment. It can be done by adding --realign_reads=false to --make_examples_extra_args flag and specifying a region with --regions flag.

@AndrewCarroll
Copy link
Collaborator

Hi @Fred-07,

As @akolesnikov mentions, the reason that the reads are re-aligned in this way has to do with some fundamental aspects of how alignment penalties work. The alignment method, Smith Waterman, has alignment score penalties for gap open and gap extend. These penalties make the alignments you see as scoring higher than ones that open and extend a gap early in the repeat.

Those alignment penalties are determined from looking at mismatch, insertion, and deletion rates across large amounts of homologous sequences. They work very well for high sequence complexity regions which have standard types of DNA replication error. In this highly repetitive context, the ways that sequence (specifically repeats) can be extended or deleted is a bit different. It might, in theory, be possible to derive better gap extend scores for regions like this, but this is both very hard and a more fundamental problem. I don't think this case is one we can easily solve without very extensive work.

I hope that the @akolesnikov suggestion of turning off realigner in regions of interest will be satisfactory.

Thank you,
Andrew

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants