Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MMseqs search not finding exact and close-exact hits #842

Open
mcn3159 opened this issue May 3, 2024 · 1 comment
Open

MMseqs search not finding exact and close-exact hits #842

mcn3159 opened this issue May 3, 2024 · 1 comment

Comments

@mcn3159
Copy link

mcn3159 commented May 3, 2024

Expected Behavior

Searching proteins against a database with similar and exact proteins (from bacterial refseq proteome) should return hits with similar and exact matches.

Current Behavior

Running mmseqs search returns few to no hits. However easy-search does output way more hits (an expected amount).

Steps to Reproduce (for bugs)

For mmseqs search:

  • create query and target databases with query_fasta and target_fasta
  • mmseqs search at 0.95 min-seq-id and coverage with coverage mode 0
  • mmseqs convertalis

For mmseqs easy-search:

  • Ran easy-search directly with query and target fastas, same search parameters

MMseqs Output (for bugs)

MMseqs search output: https://gist.github.com/mcn3159/9a5ed05852e2e83b8656d25f0333a8f3

Context

I am searching a fasta of known bacterial proteins against the bacterial refseq WP proteome. I noticed that only half of my original virulence proteins (out of ~8000) had hits against refseq. Refseq proteome is large so I found a minimal example where there is an exact match (as well as similar according to easy-search) between the target and query databases that mmseqs search doesn't seem to find, but easy-search does.

I can provide the larger fastas if more examples to replicate are necessary.

There are 2 fastas in the attached .zip file each containing 4 proteins, one of those is an exact match (same WP_number) and 2 proteins (WP_000633131.1 and WP_000633136.1) are very similar to the protein with the exact match.

fastas_to_search.zip
query fasta = query_subset.faa
target_fasta = 406_subset.faa

Your Environment

Include as many relevant details about the environment you experienced the bug in.

  • Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 15.6f452
  • Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): conda
@milot-mirdita
Copy link
Member

The trap is likely the sequence identity estimation (see https://github.com/soedinglab/MMseqs2/wiki#how-does-mmseqs2-compute-the-sequence-identity).

Adding -a or --alignment-mode 3 fixes the issue. easy-search better detects when exact sequence identity is required, search does the sequence identity estimation by default and try to detect it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants