Difference in evalue when doijng search against swissprot_human and swissprot #827

ahof1704 · 2024-03-25T11:51:11Z

I would like to find the closest human protein sequence for any given query. Often the query sequences are short (10aa). To test if mmseq could help with that, I created a fasta file that contains:

10 protein sequences from the filtered swissprot dataset to include just human samples (aka, swissprot_human)
10 short sequences created through sampling one segment of 10aa from the first 10 human sequences

The resulting sequences are illustrated in this csv (I couldn't upload the csv, so I am sharing a link. I hope you can access it).

You will notice that the same query sequence has a different evalue between swissprot and swissprot_human, which I assume plays a role in the reduced number of matches against swissprot. How can I ensure the search results are consistent for the two datasets?

Thank you so much for the help!

milot-mirdita · 2024-03-27T05:39:28Z

How did you do the searches? How did you create the swissprot_human subset?

I am surprised you are getting results at all for your test 2). The default parameters of MMseqs2 are set in such a way that you should not be able to find anything shorter than 11. The default k-mer size of 6 uses 4 non-informative spaced positions and we require a double consecutive k-mer match, meaning 11 should be the shortest.

ahof1704 · 2024-03-27T13:54:13Z

In my search, I am interested only in exact matches to known human proteins, regardless of the significance (which seems to be conditioned on the length of the sequence). For that, I have changed the significance threshold, k-mer size and spacer. This is the search command I am using:
mmseqs search /tmp/queryDB /root/mmseqs2_db/swissprot /tmp/resultDB tmp -a --max-seqs 1000 -s 7.5 --comp-bias-corr 0 --mask 0 --split-memory-limit 120G -e 10000 -k 7 --spaced-kmer-mode 0

To create the human subset of the swissprot, I used the following commands to download and filter the dataset:
mmseqs databases UniProtKB/Swiss-Prot /root/mmseqs2_db/swissprot tmp
mmseqs filtertaxseqdb /root/mmseqs2_db/swissprot /root/mmseqs2_db/swissprot_human --taxon-list 9606

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in evalue when doijng search against swissprot_human and swissprot #827

Difference in evalue when doijng search against swissprot_human and swissprot #827

ahof1704 commented Mar 25, 2024

milot-mirdita commented Mar 27, 2024

ahof1704 commented Mar 27, 2024

Difference in evalue when doijng search against swissprot_human and swissprot #827

Difference in evalue when doijng search against swissprot_human and swissprot #827

Comments

ahof1704 commented Mar 25, 2024

milot-mirdita commented Mar 27, 2024

ahof1704 commented Mar 27, 2024