Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does krakenuniq ever suppress (countable) kmers from reads? #146

Open
pfeiferd opened this issue Jul 6, 2023 · 5 comments
Open

Does krakenuniq ever suppress (countable) kmers from reads? #146

pfeiferd opened this issue Jul 6, 2023 · 5 comments

Comments

@pfeiferd
Copy link

pfeiferd commented Jul 6, 2023

Dear pioneers for much improved pathogen detection,

we have an example from nanopore (resulting in long reads > 1500 bps) where the kraken1-part of krakenuniq counts over 12000 kmers (k = 31) for a species (taxid 47466). Our own analysis revealed that these reads probably belong to just 26 unique kmers. Thus, the counted kmers are most likely "trash" in a sense that they don't indicate the presence of a corresponding pathogen. BUT: "47466" does not even occur in the corresponding krakenuniq report!

We assume there is a kind of unbuilt filter in krakenuniq to suppress such findings - or is it maybe a bug? Could you briefly explain under which conditions you keep or conversely filter out counted kmers (if so) in krakenuniq reports?
(We could potentially hand over the corresponding fastq file if you suspected a bug.)

Thanks and best regards,
Daniel

@salzberg
Copy link
Collaborator

salzberg commented Jul 6, 2023

hi @pfeiferd,
that simply should not happen: kraken1 and krakenuniq should have identical k-mer counts, as long as you used the same database. I looked it up and I see that taxid is Borrelia miyamotoi. There is no hidden or undocumented filter in krakenuniq.
Those are probably low-complexity k-mers, but even so the krakenuniq output should include them. Did you use the identical database file each time you ran the 2 programs? If you built one database using 'dust' and the other without it, that would explain it. Dust masks out low-complexity sequences and kraken-build uses it by default.

@pfeiferd
Copy link
Author

Thank you so much! I was away, but I will look into it this week and tell what's going on...

@pfeiferd
Copy link
Author

Dear Steven,

I can reproduce the beviour as described above (which is supposedly a bug according to your answer). We / I can provide you with the necessary files via a personal FTP server (the fastq file is about 121MB). We used a standard krakenuniq database ("kuniq_standard_plus_eupath_minus_kdb").

You could get the following files:

  • The fastq File (with nanopore reads > 1500 bps),
  • the krakenuniq report (with tax id 47466 missing),
  • the corresponding kraken1 output (with 1000s of hits regarding 47466),
  • our own counting for 47466 according to the kraken1 output and for some other taxids...

Let me know what you think. The FTP-Server credentials would have to be transferred via email / Skype or so.

Daniel

@salzberg
Copy link
Collaborator

I'm afraid I don't have time to debug this myself. I will ask a person in my lab about it, but it might take some time. It is possible that your reads are long enough that they got truncated by one program - although that isn't supposed to happen anyway, there's no read length limit. So I can't really guess, but we've never seen this behavior. If you send me email (rather than github) at salzberg@jhu.edu, I will see if someone can take a look.
All we really need, probably, is the fastq file. If it's not too big you can just email that to me. Presumably you just have to include some of the reads that match 47466.

@pfeiferd
Copy link
Author

Just sent a respective email... Thanks so much for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants