Very long gap in an alignment, due to extreme mononlucleotide repeat in profile #201

traviswheeler · 2020-07-23T05:29:41Z

The attached hmm and target sequence (both with .txt extension to force github to allow them to be uploaded) pair to yield a >9Kb long gap in an alignment.

target.fa.txt
DR0224948.hmm.txt

% nhmmer DR0224948.hmm target.fa > test.out

produces a single alignment with these summary stats:

>> DR0054876.1/3700-4000
     score  bias    Evalue   hmmfrom    hmm to     alifrom    ali to      envfrom    env to       sq len      acc
    ------ ----- ---------   -------   -------    --------- ---------    --------- ---------    ---------    ----
  !   97.6   1.4     7e-35      3319     12518 ..       106       182 ..        86       201 ..       301    0.41

In other words: >9Kb of model aligns to 76 target nucleotides - the alignment is basically just one big gap.

The query hmm contains several very long runs of G, including one that's almost 4KB long. That's a strange model, but still shouldn't produce this kind of alignment. I haven't identified a fix yet, but want to be sure this is documented, and am planning to investigate.

The text was updated successfully, but these errors were encountered:

cryptogenomicon · 2020-07-23T12:28:05Z

I've seen this sort of thing when an insert-insert transition parameter is 0, which can happen after rounding off to a scaled integer. There's code to prevent this: scaled int insert parameters are bounded at -1, so there's always a penalty for an insertion. Maybe nhmmer is somehow overriding that?

traviswheeler · 2020-07-24T23:21:02Z

Poking around a bit - this is not specific to nhmmer. I get the same crazy deletion running hmmsearch:

% hmmsearch  DR0224948.hmm target.fa  
.
.
.
   #    score  bias  c-Evalue  i-Evalue hmmfrom  hmm to    alifrom  ali to    envfrom  env to     acc
 ---   ------ ----- --------- --------- ------- -------    ------- -------    ------- -------    ----
   1 !  104.8   0.0   1.6e-35   1.6e-35    3319   12517 ..     106     181 ..      86     201 .. 0.41

That's >9000 model states aligned to 75 target sequence characters, which means a whole lot of delete states are getting used. In the HMM, all the delete transitions are the same, roughly 33% and 67%:

    d->m     d->d
   1.09861  0.40547

What I see: p7_domaindef_ByPosteriorHeuristics() identifies a single envelope on the target sequence (86..201). Based on the end of the alignment, this seems probably reasonable (there's a really good alignment from 107..181).

Then in rescore_isolated_domain(), p7_OptimalAccuracy() is asked to fill in the 18309x116 DP matrix (the entire model vs this little target sequence envelope), and then p7_OATrace() traces through those ~75 p7T_M states (sequence positions 181 back to 107) ... followed by ~9000 p7T_D states, then picking up one last p7T_M state. The DP matrix fill+trace is complicated enough that it's a struggle for me to follow - I'm not sure if it's a problem in how ox is being filled, or in how OATrace is reading through that matrix.

At this point, I've run out of juice, and won't have the free time to hunt it any further for a while. Since this is generic to p7_domaindef_ByPosteriorHeuristics(), and impacts hmmsearch (and presumably phmmer) in addition to nhmmer, maybe it'll make its way onto your to-do list?

cryptogenomicon · 2023-06-06T01:49:53Z

I've looked into this (at last). This is an unusual new case where H3's domain definition heuristics are failing. I won't be able to fix this in HMMER3, but I believe it has been solved by new algorithms that will be in HMMER4. I'm going to leave the issue open, tag it for H4, and we'll revisit it in the future.

The problem arises from the fact that the profile here consists of ~30 multiple tandem repeats of ~380nt (plus a few kb of polyG; the polyG mononuc repeats are a red herring, they are not the problem). The target consists of a piece of one repeat. HMMER isn't designed for this; it's designed for the reverse, for your profile to be one "domain" or repeat, and for the target to consist of one or more domains/repeats. It would usually still work anyway (some Pfam models consist of two tandem domains, to increase signal/noise on some short tandem domain families), but this is a major factor in the problem. So one workaround is to define your profile to be a single ~380nt repeat.

In more detail: the target aligns to this profile in ~30 different places. The marginalized posterior probability matrix, projected onto the target, says that the target definitely has a homologous region (an "envelope"). But the model is uncertain where the alignment is on the profile; with the posterior probability spread over ~30 possible alignments, all places have low posterior probability. You can see this uncertainty in the per-residue posterior probabilities, which are very low (2's; ~20% probable). A Viterbi optimal alignment would picks the highest-scoring possibility. But we do "optimal accuracy" alignment, not Viterbi, by default; OA looks for an alignment that has the greatest sum of posterior decoding probabilities. Because there's ~30 different places where the target envelope can align more or less as well to the profile, it just happens that in this particular example it can cobble together an OA alignment that picks up the first part of a repeat at one place in the profile and the rest of the repeat at a distant place in the profile. The result is a stupid alignment, but it is a "correct" consequence of the H3 domain definition heuristics interacting with OA alignment, as I defined the procedure - i.e. it's not a "bug" per se, I can't do anything about it without changing the algorithm. And that's what H4 will do. One of the main goals of H4 has been to get rid of the domain definition heuristics, which I've seen fail on some other rare cases too.

Biologically, your profile is interesting and unusual. Although you have it annotated as a "putative TE family", this is not likely to be a transposon; it's an array of at least 30 tRNA genes, consisting of 4+ repeats of an Arg/iMet{x5-6} unit of an Arg tRNA followed by 5-6 iMet tRNAs. tRNAscan-SE output on the 18309nt consensus of your profile:

Sequence            		tRNA 	Bounds	tRNA	Anti	Intron Bounds	Inf	      
Name                	tRNA #	Begin	End  	Type	Codon	Begin	End	Score	Note
--------            	------	-----	------	----	-----	-----	----	------	------
DR0224948-consensus 	1	53   	124  	iMet	CAT	0	0	60.4	
DR0224948-consensus 	2	438  	509  	iMet	CAT	0	0	60.4	
DR0224948-consensus 	3	815  	886  	iMet	CAT	0	0	60.4	
DR0224948-consensus 	4	1235 	1307 	Arg	CCT	0	0	67.5	
DR0224948-consensus 	5	1314 	1385 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	6	1691 	1762 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	7	2068 	2139 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	8	2443 	2514 	iMet	CAT	0	0	58.4	
DR0224948-consensus 	9	2820 	2891 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	10	3240 	3312 	Arg	CCT	0	0	67.5	
DR0224948-consensus 	11	3319 	3390 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	12	3694 	3765 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	13	4071 	4142 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	14	4457 	4528 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	15	4847 	4918 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	16	5224 	5295 	iMet	CAT	0	0	60.4	
DR0224948-consensus 	17	10058	10130	Arg	CCT	0	0	67.5	
DR0224948-consensus 	18	10137	10208	iMet	CAT	0	0	60.4	
DR0224948-consensus 	19	10521	10592	iMet	CAT	0	0	60.4	
DR0224948-consensus 	20	10917	10988	iMet	CAT	0	0	60.4	
DR0224948-consensus 	21	11289	11360	iMet	CAT	0	0	60.4	
DR0224948-consensus 	22	11664	11735	iMet	CAT	0	0	47.4	
DR0224948-consensus 	23	12041	12112	iMet	CAT	0	0	51.0	
DR0224948-consensus 	24	12361	12435	Arg	CCT	0	0	33.5	pseudo
DR0224948-consensus 	25	12442	12513	iMet	CAT	0	0	60.4	
DR0224948-consensus 	26	16593	16664	iMet	CAT	0	0	60.4	
DR0224948-consensus 	27	16971	17042	iMet	CAT	0	0	60.4	
DR0224948-consensus 	28	17348	17419	iMet	CAT	0	0	60.4	
DR0224948-consensus 	29	17720	17791	iMet	CAT	0	0	60.4	
DR0224948-consensus 	30	18105	18176	iMet	CAT	0	0	47.4

The metadata on the profile say there were 164 sequences in the alignment, all from one genome assembly of the redlip mullet Planiliza haematocheilus (a teleost fish; GCA_005024645.1). Are there really 164 copies of this ~18kb unit in that genome? That'd be ~4800 tRNAs; nobody needs that many iMet tRNAs, and (assuming the assembly is correct) there must be an interesting reason they're there.

traviswheeler · 2023-06-06T04:39:05Z

Thanks - the role of OA in these shenanigans makes sense.

Your question about the tRNAs is a really good one. I'll pass it on to the developers of the model.

cryptogenomicon added wontfix H4 In HMMER4 development labels Jun 6, 2023

cryptogenomicon self-assigned this Jun 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very long gap in an alignment, due to extreme mononlucleotide repeat in profile #201

Very long gap in an alignment, due to extreme mononlucleotide repeat in profile #201

traviswheeler commented Jul 23, 2020

cryptogenomicon commented Jul 23, 2020

traviswheeler commented Jul 24, 2020

cryptogenomicon commented Jun 6, 2023

traviswheeler commented Jun 6, 2023

Very long gap in an alignment, due to extreme mononlucleotide repeat in profile #201

Very long gap in an alignment, due to extreme mononlucleotide repeat in profile #201

Comments

traviswheeler commented Jul 23, 2020

cryptogenomicon commented Jul 23, 2020

traviswheeler commented Jul 24, 2020

cryptogenomicon commented Jun 6, 2023

traviswheeler commented Jun 6, 2023