Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected behaviour - v 2.28 sintax chooses first sequence when not classifiable #563

Open
givdieri opened this issue May 17, 2024 · 4 comments
Assignees

Comments

@givdieri
Copy link

givdieri commented May 17, 2024

command:
$ vsearch --db UNITE10.fasta --sintax refs.fasta --tabbedout refs_sintaxonomy.tsv --sintax_cutoff 0.8 --sintax_random
vsearch v2.28.1_linux_x86_64, 251.2GB RAM, 96 cores

produces this output for the first 5 sequences (these are not closely related):

<style> </style>
Laccaria amethystina ITS d:Fungi(1.00),p:Basidiomycota(0.79),c:Agaricomycetes(0.78),o:Boletales(0.73),f:Paxillaceae(0.73),g:Melanogaster(0.73),s:SH0000009.10FU(0.73)
Tomentella sublilacina ITS d:Fungi(1.00),p:Basidiomycota(0.74),c:Agaricomycetes(0.74),o:Boletales(0.48),f:Paxillaceae(0.48),g:Melanogaster(0.48),s:SH0000009.10FU(0.48)
Inocybe napipes ITS d:Fungi(1.00),p:Basidiomycota(0.90),c:Agaricomycetes(0.90),o:Boletales(0.77),f:Paxillaceae(0.77),g:Melanogaster(0.77),s:SH0000009.10FU(0.77)
Lactarius subdulcis ITS d:Fungi(1.00),p:Ascomycota(0.44),c:Agaricomycetes(0.37),o:Saccharomycetales(0.16),f:Paxillaceae(0.15),g:Melanogaster(0.15),s:SH0000009.10FU(0.15)
Russula ochroleuca ITS d:Fungi(1.00),p:Basidiomycota(0.46),c:Agaricomycetes(0.44),o:Boletales(0.33),f:Paxillaceae(0.33),g:Melanogaster(0.33),s:SH0000009.10FU(0.33)

These sequences are mapping to Melanogaster (SH00009.10FU).
Maybe they are not 'classifiable' and therefore map to the first close sequence in the DB file? (Melanogaster ref is the 9th sequence in the reference file of 159189 sequences).
Melanogaster is definitely not the closest match in the DB.

The previous version I used (v2.21) resulted in these sequences not being classified

@torognes
Copy link
Owner

Thank you for reporting this problem. I'm looking into it now.

@torognes torognes self-assigned this May 21, 2024
@torognes
Copy link
Owner

I am unable to reproduce the problem with the given information. Could you please provide the sequences of some of the query sequences in refs.fasta that you used, e.g. Laccaria amethystina ITS. Could you please also indicate exactly which database file you have used, as I cannot find the SH0000009.10FU sequence in the UNITE version 10 files available at https://unite.ut.ee/repository.php.

@givdieri
Copy link
Author

Mr. Rognes,
Thanks for looking into it!

It is UNITE version 10, mutated so that the SH codes are in place of the species name.
Added to the UNITE10.fasta file are a few of my own reference sequences (those that did not map to any SH in a previous round of SINTAX classification.), I've given them fake SH codes for easier downstream string manipulation (SH0*[1-9].09FU).

Attached are part of refs.fasta and the added sequences to UNITE10 that show the mutated formatting.

refs_added_UNITE10.txt
partial_refs.txt

@torognes
Copy link
Owner

Sorry, but I am still unable to reproduce the results you get. Are you sure the input and database files are properly formatted? I've performed several tests with your sequences that all give reasonable results.

Could you try to make a tiny example (as small as possible) that still gives the wrong results, and present the exact files used and the exact command line?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants