Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating custom db causes classification issues #257

Open
najoshi opened this issue Feb 23, 2023 · 3 comments
Open

creating custom db causes classification issues #257

najoshi opened this issue Feb 23, 2023 · 3 comments

Comments

@najoshi
Copy link

najoshi commented Feb 23, 2023

So I have created a custom database by taking the refseq proteins and adding proteins from a database called RUG2. When I run classification with the regular refseq database on one of my samples, I get about 18M classified reads. When I run the same sample with refseq plus RUG2, I only get about 11K reads. I don't understand why adding proteins to an existing database to create a new database results in so much fewer classifications. I'm happy to share any files you need to debug the issue. Any help would be highly appreciated.

@pmenzel
Copy link
Member

pmenzel commented Feb 24, 2023

Some points you can check:

  • the taxonomy must work out also with the RUG2 database: does your fasta file has proper headers with proper taxonomy IDs that are also contained in your names.dmp / tree.dmp
  • what happens when you make a kaiju index only of the RUG2 database and classify the reads
  • use one of the sequences from your DB and give it as input to kaiju -p to classify it, it should be found (obvisouly)

@najoshi
Copy link
Author

najoshi commented Mar 1, 2023

So if I have some headers in my custom fasta file that do NOT have tax IDs that occur in nodes.dmp... will that cause problems?
When I run kaiju using only the RUG2 database, I get very few classifications.
When I get one of the RUG2 protein sequences and run it against my custom DB with kaiju -p, it DOES NOT classify it. So that's obviously a problem.

@najoshi
Copy link
Author

najoshi commented Mar 1, 2023

Looks like if there is any header where the tax ID does not occur in nodes.dmp then it screws up the database. Once I took out the proteins that had tax IDs that don't occur in nodes.dmp (and proteins with X's in them), the database built properly and it seems to be classifying reads well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants