Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference in evalue when doijng search against swissprot_human and swissprot #827

Open
ahof1704 opened this issue Mar 25, 2024 · 2 comments

Comments

@ahof1704
Copy link

I would like to find the closest human protein sequence for any given query. Often the query sequences are short (10aa). To test if mmseq could help with that, I created a fasta file that contains:

  1. 10 protein sequences from the filtered swissprot dataset to include just human samples (aka, swissprot_human)
  2. 10 short sequences created through sampling one segment of 10aa from the first 10 human sequences

The resulting sequences are illustrated in this csv (I couldn't upload the csv, so I am sharing a link. I hope you can access it).

You will notice that the same query sequence has a different evalue between swissprot and swissprot_human, which I assume plays a role in the reduced number of matches against swissprot. How can I ensure the search results are consistent for the two datasets?

Thank you so much for the help!

@milot-mirdita
Copy link
Member

How did you do the searches? How did you create the swissprot_human subset?

I am surprised you are getting results at all for your test 2). The default parameters of MMseqs2 are set in such a way that you should not be able to find anything shorter than 11. The default k-mer size of 6 uses 4 non-informative spaced positions and we require a double consecutive k-mer match, meaning 11 should be the shortest.

@ahof1704
Copy link
Author

In my search, I am interested only in exact matches to known human proteins, regardless of the significance (which seems to be conditioned on the length of the sequence). For that, I have changed the significance threshold, k-mer size and spacer. This is the search command I am using:
mmseqs search /tmp/queryDB /root/mmseqs2_db/swissprot /tmp/resultDB tmp -a --max-seqs 1000 -s 7.5 --comp-bias-corr 0 --mask 0 --split-memory-limit 120G -e 10000 -k 7 --spaced-kmer-mode 0

To create the human subset of the swissprot, I used the following commands to download and filter the dataset:
mmseqs databases UniProtKB/Swiss-Prot /root/mmseqs2_db/swissprot tmp
mmseqs filtertaxseqdb /root/mmseqs2_db/swissprot /root/mmseqs2_db/swissprot_human --taxon-list 9606

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants