You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to find the closest human protein sequence for any given query. Often the query sequences are short (10aa). To test if mmseq could help with that, I created a fasta file that contains:
10 protein sequences from the filtered swissprot dataset to include just human samples (aka, swissprot_human)
10 short sequences created through sampling one segment of 10aa from the first 10 human sequences
The resulting sequences are illustrated in this csv (I couldn't upload the csv, so I am sharing a link. I hope you can access it).
You will notice that the same query sequence has a different evalue between swissprot and swissprot_human, which I assume plays a role in the reduced number of matches against swissprot. How can I ensure the search results are consistent for the two datasets?
Thank you so much for the help!
The text was updated successfully, but these errors were encountered:
How did you do the searches? How did you create the swissprot_human subset?
I am surprised you are getting results at all for your test 2). The default parameters of MMseqs2 are set in such a way that you should not be able to find anything shorter than 11. The default k-mer size of 6 uses 4 non-informative spaced positions and we require a double consecutive k-mer match, meaning 11 should be the shortest.
In my search, I am interested only in exact matches to known human proteins, regardless of the significance (which seems to be conditioned on the length of the sequence). For that, I have changed the significance threshold, k-mer size and spacer. This is the search command I am using: mmseqs search /tmp/queryDB /root/mmseqs2_db/swissprot /tmp/resultDB tmp -a --max-seqs 1000 -s 7.5 --comp-bias-corr 0 --mask 0 --split-memory-limit 120G -e 10000 -k 7 --spaced-kmer-mode 0
To create the human subset of the swissprot, I used the following commands to download and filter the dataset: mmseqs databases UniProtKB/Swiss-Prot /root/mmseqs2_db/swissprot tmp mmseqs filtertaxseqdb /root/mmseqs2_db/swissprot /root/mmseqs2_db/swissprot_human --taxon-list 9606
I would like to find the closest human protein sequence for any given query. Often the query sequences are short (10aa). To test if mmseq could help with that, I created a fasta file that contains:
The resulting sequences are illustrated in this csv (I couldn't upload the csv, so I am sharing a link. I hope you can access it).
You will notice that the same query sequence has a different evalue between swissprot and swissprot_human, which I assume plays a role in the reduced number of matches against swissprot. How can I ensure the search results are consistent for the two datasets?
Thank you so much for the help!
The text was updated successfully, but these errors were encountered: