Feature request: Limit the MSA depth in Jackhmmer/Nhmmer/Phmmer/Hmmsearch #323

Augustin-Zidek · 2024-04-10T11:03:24Z

Would it be possible to add a flag that limits the number of sequences returned by Jackhmmer/Nhmmer/Phmmer/Hmmsearch?

When HMMER is used in the context of protein structure prediction, we usually care only about the top n hits (usually ~5k) sorted by e-value and the rest are discarded.

Therefore aligning them, outputting them to stdout, and saving them to disk is a wasteful operation, as we then read just the top n and discard the rest.

Moreover, queries with extreme number of hits lead to unnecessary long run times and need a lot of RAM to store the results.

We added our own limited and brittle workaround for this in Jackhmmer, but that only affects the output MSA, but doesn't address the unnecessary computation that happens before:

 p7_tophits_SortBySortkey(info->th);
 p7_tophits_Threshold(info->th, info->pli);
+if (esl_opt_IsOn(go, "--seq_limit"))
+{
+    int seq_limit = esl_opt_GetInteger(go, "--seq_limit");
+    info->th->N = ESL_MIN(info->th->N, seq_limit);
+}
  p7_tophits_CompareRanking(info->th, kh, &nnew_targets);

A better and more robust solution would be probably to modify P7_TOPHITS to have a max_depth attribute that doesn't allow top hits to grow beyond that number.

The text was updated successfully, but these errors were encountered:

cryptogenomicon · 2024-04-10T12:46:11Z

It's a high priority for us to make better ways of dealing with tons of hits and large alignments.

Unlikely that we'll do something like first-N hits, though, because it makes search results dependent on the order of sequences in the target db.

Or if you want the top-N, not just first-N, then that doesn't save you computation, only space - and you should be able to achieve a top-N strategy now by filtering or downsamplng an output tabular data file any way you want, and immediately discarding the original large output file.

We're working on other downsampling methods, and also on highly compressed alignment representations.

Augustin-Zidek · 2024-04-10T13:13:04Z

Oh sorry I wasn't clear: I meant top-N sorted by e-value, not first-N. Thinking of P7_TOPHITS essentially being something like a min-heap of size N.

Or if you want the top-N, not just first-N, then that doesn't save you computation, only space - and you should be able to achieve a top-N strategy now by filtering or downsamplng an output tabular data file any way you want, and immediately discarding the original large output file.

Yes, good point, you have to align to query and sort anyway. But it would still be useful if this functionality was built in, as multi-GB MSA files could be produced, and saving those could take non-trivial amount of time, especially with remote filesystems.

We observed this with a sequence that has over 1M hits in UniRef90 with e-value limit of 0.001 and produces 45 GB Stockholm MSA and 256 MB tblout files.

cryptogenomicon · 2024-04-10T13:16:03Z

One way you can go is to generate the tblout file (and not the full MSA); filter or downsample the lines in the tblout file; fetch those subsequences (esl-sfetch for instance); and use hmmalign to align them to the query model. This saves storing the 45G MSA.

Augustin-Zidek · 2024-04-10T13:25:57Z

Thanks, that is an interesting thing to try.

However, even with -o /dev/null won't it do a lot of unnecessary IO?

cryptogenomicon · 2024-04-10T13:29:23Z

Without -A <msafile> it won't even generate the multiple alignment internally (saving a ton of RAM and some computation), and -o /dev/null saves a lot of storage and probably a bunch of output time too. I definitely focus on the tabular output files when I'm manipulating large output volumes (which these days is nigh-always).

Augustin-Zidek · 2024-04-10T14:18:31Z

Here are timings with and without -A:

Without -A

$ jackhmmer \
-o /dev/null --tblout tblout.txt \
--noali \
--cpu 22 \
-N 1 \
-E 0.001 --incE 0.001 \
query.fasta uniref90.fasta

real    18m26.775s
user    332m53.777s
sys     3m47.448s

With -A

$ jackhmmer \
-o /dev/null -A out.a3m --tblout tblout.txt \
--noali \
--cpu 22 \
-N 1 \
-E 0.001 --incE 0.001 \
query.fasta uniref90.fasta

real    20m1.193s
user    332m53.121s
sys     4m5.238s

I am testing with this sequence:

> query
PNPCPESSASFLSRITFWWITGMMVQGYRQPLESTDLWSLNKEDTSEQVVPVLVKNWKKECAKSRKQPVKIVYSSKDPAK
PKGSSKVDVNEEAEALIVKCPQKERDPSLFKVLYKTFGPYFLMSFLFKAVHDLMMFAGPEILKLLINFVNDKKAPEWQGY
FYTALLFISACLQTLVLHQYFHICFVSGMRIKTAVIGAVYRKALVITNAARKSSTVGEIVNLMSVDAQRFMDLATYINMI
WSAPLQVILALYLLWLNLGPSVLAGVAVMVLMVPLNAVMAMKTKTYQVAHMKSKDNRIKLMNEILNGIKVLKLYAWELAF
KDKVLAIRQEELKVLKKSAYLAAVGTFTWVCTPFLVALSTFAVYVTVDENNILDAQKAFVSLALFNILRFPLNILPMVIS
SIVQASVSLKRLRVFLSHEDLDPDSIQRRPIKDAGATNSITVKNATFTWARNDPPTLHGITFSVPEGSLVAVVGQVGCGK
SSLLSALLAEMDKVEGHVTVKGSVAYVPQQAWIQNISLRENILFGRQLQERYYKAVVEACALLPDLEILPSGDRTEIGEK
GVNLSGGQKQRVSLARAVYCDSDVYLLDDPLSAVDAHVGKHIFENVIGPKGLLKNKTRLLVTHAISYLPQMDVIIVMSGG
KISEMGSYQELLARDGAFAEFLRTYASAEQEQGQPEDGLAGVGGPGKEVKQMENGMLVTDTAGKQMQRQLSSSSSYSRDV
SQHHTSTAELRKPGPTEETWKLVEADKAQTGQVKLSVYWDYMKAIGLFISFLSIFLFLCNHVASLVSNYWLSLWTDDPIV
NGTQEHTQVRLSVYGALGISQGITVFGYSMAVSIGGIFASRRLHLDLLHNVLRSPISFFERTPSGNLVNRFSKELDTVDS
MIPQVIKMFMGSLFNVIGACIIILLATPMAAVIIPPLGLIYFFVQRFYVASSRQLKRLESVSRSPVYSHFNETLLGVSVI
RAFEEQERFIRQSDLKVDENQKAYYPSIVANRWLAVRLECVGNCIVLFASLFAVISRHSLSAGLVGLSVSYSLQVTTYLN
WLVRMSSEMETNIVAVERLKEYSETEKEAPWQIQDMAPPKDWPQVGRVEFRDYGLRYREDLDLVLKHINVTIDGGEKVGI
VGRTGAGKSSLTLGLFRIKESAEGEIIIDDINIAKIGLHDLRFKITIIPQDPVLFSGSLRMNLDPFSQYSDEEVWTSLEL
AHLKGFVSALPDKLNHECAEGGENLSVGQRQLVCLARALLRKTKILVLDEATAAVDLETDDLIQSTIRTQFDDCTVLTIA
HRLNTIMDYTRVIVLDKGEIQEWGSPSDLLQQRGLFYSMAKDSGLV

cryptogenomicon · 2024-04-10T15:13:43Z

(Pleasantly surprised to see the multithreaded parallelization scaling well to 22 cores; normally jackhmmer starts becoming input-bound very quickly, even at 2 cores or so. You must have some fast storage.)

Augustin-Zidek · 2024-04-10T18:29:34Z

A few more observations.

We have one further patch to output a3m instead of Stockholm from Jackhmmer:

-if (textw > 0) esl_msafile_Write(afp, msa, eslMSAFILE_STOCKHOLM);
+// Write the output as A2M unless a '.sto' file extension is provided.
+if (textw > 0) {
+    if (afp_ext && strcmp(afp_ext, "sto") == 0) {
+        esl_msafile_Write(afp, msa, eslMSAFILE_STOCKHOLM);
+    } else if (strcmp(esl_opt_GetString(go, "-A"), "/dev/stdout") == 0 ||
+               (afp_ext && (strcmp(afp_ext, "a3m") == 0 || strcmp(afp_ext, "a2m") == 0))) {
+        esl_msafile_Write(afp, msa, eslMSAFILE_A2M);
+    } else {
+        p7_Fail("Bad output MSA format %s, only a2m/a3m and sto are supported\n", afp_ext);
+    }
+}
 else           esl_msafile_Write(afp, msa, eslMSAFILE_PFAM);

This patch reduces the output MSA size from 45 GB (Stockholm) to 1.6 GB (a3m) and slightly reduces the runtime to:

real    19m2.367s
user    324m10.399s
sys     2m23.680s

This patch together with the --seq_limit patch, we see significantly better runtime (and only 1.6 GB MSA a3m output file size):

real    15m22.416s
user    321m15.616s
sys     1m49.013s

It therefore seems to me that there must me some non-trivial amount of work that is eliminated thanks to applying the naive sequence limit patch.

Augustin-Zidek · 2024-04-19T08:55:04Z

I did a bit of profiling and here is the profile for the offending query with --cpu 1 (and with our --seq_limit and a3m-output patches applied).

For comparison, here is profile for a query that runs 2.5x faster because it doesn't have that many hits:

Does the profile look reasonable to you? Is it reasonable for p7_domaindef_ByPosteriorHeuristics to account for most of the run time?

npcarter · 2024-04-19T20:39:14Z

The time breakdown of a hmmsearch run will vary tremendously depending on the fraction of sequences in the database that are detected as homologs. While I haven't done much profiling of hmmsearch with the -A option on, those profiles show a much higher fraction of time being spent in the later stages of our comparison pipeline, which would include p7_domaindef_ByPosteriorHeuristics, than the profiles I've seen. A more typical run would have the SSV filter, which you're seeing as calc_step_14 taking 40-60% of the run time of the compute thread, and the final stage, which calls p7_domaindef_ByPosteriorHeuristics, taking more like 10%. For example, here's the top of the output of a perf report on the compute thread when I searched the Caudal_act HMM from Pfam against a version of the Uniprot TrEMBL database, with the -A option turned on: + 72.02% 60.86% hmmsearch hmmsearch [.] calc_band_9 + 13.78% 13.17% hmmsearch hmmsearch [.] p7_ViterbiFilter + 10.59% 5.68% hmmsearch hmmsearch [.] p7_pli_NewSeq + 8.52% 6.89% hmmsearch hmmsearch [.] esl_hmm_Forward + 6.01% 0.00% hmmsearch [unknown] [.] 0x8000800080008000 + 4.41% 0.00% hmmsearch [unknown] [.] 0x8000800080008000 4.39% 4.34% hmmsearch hmmsearch [.] forward_engine + 2.35% 0.00% hmmsearch [unknown] [.] 0x00005740f62e3120 + 2.35% 0.00% hmmsearch [unknown] [.] 0x00005740f62e3ad0 1.21% 1.00% hmmsearch hmmsearch [.] p7_MSVFilter Here, the left-hand column of percentages is runtime fraction of the function and all its children, while the column to the right is the run-time fraction of the function itself. In this search, hmmsearch finds about 2k homologs in searching 200M+ sequences, so spends most of its time in the higher stages of the comparison pipeline. (Note that the number after calc_band_ is the number of SIMD vectors used in computing a row of the DP matrix, so varies with HMM length.) If you have the outputs of these searches, could you send me the pipeline statistics information that appears at the very end of the output? That'd let me see what fraction of the HMM-sequence comparisons are making it to each stage of the pipeline. Also, can you tell me what tool you used to generate those performance profiles? I haven't been happy with the ones I have access to, so would appreciate a pointer to any good ones you know of.

…

--Nick

On Fri, Apr 19, 2024 at 4:55 AM Augustin Zidek ***@***.***> wrote: I did a bit of profiling and here is the profile for the offending query with --cpu 1 (and with our --seq_limit and a3m-output patches applied). image.png (view on web) <https://github.com/EddyRivasLab/hmmer/assets/6466058/3ab3902c-fce0-46ed-b0f0-55d2d9ecde11> For comparison, here is profile for a query that runs 2.5x faster because it doesn't have that many hits: image.png (view on web) <https://github.com/EddyRivasLab/hmmer/assets/6466058/b3b55f5d-58cc-496c-83c6-638cc43f5c1c> Does the profile look reasonable to you? Is it reasonable for p7_domaindef_ByPosteriorHeuristics to account for most of the run time? — Reply to this email directly, view it on GitHub <#323 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDJBZENH6LN7JHSQZW6W7LY6DLX5AVCNFSM6AAAAABGAG6PVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRWGEZTCNRRGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Augustin-Zidek · 2024-04-22T09:39:30Z

Thanks for the extra details, Nick!

The time breakdown of a hmmsearch run will vary tremendously depending on the fraction of sequences in the database that are detected as homologs.

Indeed -- this query (see #323 (comment)) has over 1M homologs in UniRef90.

For example, here's the top of the output of a perf report on the compute thread when I searched the Caudal_act HMM from Pfam against a version of the Uniprot TrEMBL database

Could run this report with the sequence and database I reported above to check if you get similar numbers to what I am seeing when profiling?

If you have the outputs of these searches, could you send me the pipeline statistics information that appears at the very end of the output?

Absolutely, here they are:

Internal pipeline statistics summary:
-------------------------------------
Query model(s):                            1  (1326 nodes)
Target sequences:                  135301051  (45360760554 residues searched)
Passed MSV filter:                   5218733  (0.0385713); expected 2706021.0 (0.02)
Passed bias filter:                  4358682  (0.0322147); expected 2706021.0 (0.02)
Passed Vit filter:                   1763645  (0.013035); expected 135301.1 (0.001)
Passed Fwd filter:                   1501121  (0.0110947); expected 1353.0 (1e-05)
Initial search space (Z):          135301051  [actual number of targets]
Domain search space  (domZ):         1061640  [number of targets reported over threshold]
# CPU time: 19273.26u 351.44s 05:27:04.69 Elapsed: 00:22:30.69
# Mc/sec: 44531.39

@@ New targets included:   1061640
@@ New alignment includes: 1072120 subseqs (was 1), including original query
# Alignment of 1072120 hits satisfying inclusion thresholds saved to: out.a3m

Also, can you tell me what tool you used to generate those performance profiles?

I used the perf record command (https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record):

perf record -g <jackhmmer command>

I then visualiased the generated perf.data using pprof (https://github.com/google/pprof):

pprof -flame perf.data

This might require an extra conversion step (see https://github.com/google/pprof#using-pprof-with-linux-perf).

npcarter · 2024-04-22T15:49:11Z

Thanks for the pointers on profiling tools. I've used perf, but hadn't seen the pprof visualizer. Looking at the pipeline statistics you sent, you're seeing a very large fraction of both hits and comparisons that make it to the later stages of the pipeline, which I think explains the profiling results you're seeing. One of our rules of thumb is that the most extreme searches will find 1% of the database as hits, and you're seeing close to that. You're also seeing significantly more comparisons making it through each of the filter stages than would normally be expected, even after subtracting the comparisons that become hits. For example, the MSV filter is tuned such that we expect 2% of non-homologous sequences to pass, and you're seeing almost 4% (5M that pass MSV - 1M that are detected as homologs). Given all that, the profile you sent doesn't surprise me that much. I'm pretty busy today but will try to run your query and database on our machines sometime in the next few days to compare.

…

-Nick

On Mon, Apr 22, 2024 at 5:39 AM Augustin Zidek ***@***.***> wrote: Thanks for the extra details, Nick! The time breakdown of a hmmsearch run will vary tremendously depending on the fraction of sequences in the database that are detected as homologs. Indeed -- this query (see #323 (comment) <#323 (comment)>) has over 1M homologs in UniRef90. For example, here's the top of the output of a perf report on the compute thread when I searched the Caudal_act HMM from Pfam against a version of the Uniprot TrEMBL database Could run this report with the sequence and database I reported above to check if you get similar numbers to what I am seeing when profiling? If you have the outputs of these searches, could you send me the pipeline statistics information that appears at the very end of the output? Absolutely, here they are: Internal pipeline statistics summary: ------------------------------------- Query model(s): 1 (1326 nodes) Target sequences: 135301051 (45360760554 residues searched) Passed MSV filter: 5218733 (0.0385713); expected 2706021.0 (0.02) Passed bias filter: 4358682 (0.0322147); expected 2706021.0 (0.02) Passed Vit filter: 1763645 (0.013035); expected 135301.1 (0.001) Passed Fwd filter: 1501121 (0.0110947); expected 1353.0 (1e-05) Initial search space (Z): 135301051 [actual number of targets] Domain search space (domZ): 1061640 [number of targets reported over threshold] # CPU time: 19273.26u 351.44s 05:27:04.69 Elapsed: 00:22:30.69 # Mc/sec: 44531.39 @@ New targets included: 1061640 @@ New alignment includes: 1072120 subseqs (was 1), including original query # Alignment of 1072120 hits satisfying inclusion thresholds saved to: out.a3m Also, can you tell me what tool you used to generate those performance profiles? I used the perf record command ( https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record ): perf record -g '<jackhmmer command>' I then visualiased the generated perf.data using pprof ( https://github.com/google/pprof): pprof -flame perf.data This might require an extra conversion step (see https://github.com/google/pprof#using-pprof-with-linux-perf). — Reply to this email directly, view it on GitHub <#323 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABDJBZBSM76OU76FKZZIQVLY6TLGRAVCNFSM6AAAAABGAG6PVSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANRYHE2DIMBSGE> . You are receiving this because you commented.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Limit the MSA depth in Jackhmmer/Nhmmer/Phmmer/Hmmsearch #323

Feature request: Limit the MSA depth in Jackhmmer/Nhmmer/Phmmer/Hmmsearch #323

Augustin-Zidek commented Apr 10, 2024

cryptogenomicon commented Apr 10, 2024

Augustin-Zidek commented Apr 10, 2024

cryptogenomicon commented Apr 10, 2024

Augustin-Zidek commented Apr 10, 2024

cryptogenomicon commented Apr 10, 2024

Augustin-Zidek commented Apr 10, 2024 •

edited

cryptogenomicon commented Apr 10, 2024

Augustin-Zidek commented Apr 10, 2024 •

edited

Augustin-Zidek commented Apr 19, 2024

npcarter commented Apr 19, 2024 via email

Augustin-Zidek commented Apr 22, 2024 •

edited

npcarter commented Apr 22, 2024 via email

Feature request: Limit the MSA depth in Jackhmmer/Nhmmer/Phmmer/Hmmsearch #323

Feature request: Limit the MSA depth in Jackhmmer/Nhmmer/Phmmer/Hmmsearch #323

Comments

Augustin-Zidek commented Apr 10, 2024

cryptogenomicon commented Apr 10, 2024

Augustin-Zidek commented Apr 10, 2024

cryptogenomicon commented Apr 10, 2024

Augustin-Zidek commented Apr 10, 2024

cryptogenomicon commented Apr 10, 2024

Augustin-Zidek commented Apr 10, 2024 • edited

cryptogenomicon commented Apr 10, 2024

Augustin-Zidek commented Apr 10, 2024 • edited

Augustin-Zidek commented Apr 19, 2024

npcarter commented Apr 19, 2024 via email

Augustin-Zidek commented Apr 22, 2024 • edited

npcarter commented Apr 22, 2024 via email

Augustin-Zidek commented Apr 10, 2024 •

edited

Augustin-Zidek commented Apr 10, 2024 •

edited

Augustin-Zidek commented Apr 22, 2024 •

edited