Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blatant miscount of kmers in krakenuniq report file #145

Open
pfeiferd opened this issue Jun 13, 2023 · 3 comments
Open

Blatant miscount of kmers in krakenuniq report file #145

pfeiferd opened this issue Jun 13, 2023 · 3 comments

Comments

@pfeiferd
Copy link

Dear krakenuniq-team,

  1. Take the two reads from below (at the end of this issue-report) an put them in a fastq file (lets call the file "/mnt/covid/fastqs/error.fastq"). The file contains two reads which can be assigned to SAR-Cov-2.

  2. Run krakenuniq as follows (or correspondingly):

krakenuniq --exact --report-file kuout.csv --threads 8 -db /mnt/m2/kuniqdb/kuniq_standard_plus_eupath_minus_kdb /mnt/covid/fastqs/error.fastq

Then then kraken1 part of krakenuniq produces the following (correct) classification output on the console:

C A01245:144:HMV7FDSX3:1:2217:5963:30452/1 2697049 151 2697049:101 0:20
C A01246:144:HMV7FDSX3:1:2217:5963:30452/2 2697049 148 0:14 2697049:104

So in total, there are 205 kmers that belong to 2697049.

  1. BUT the derived report file from krakenuniq counts only 104 kmers. It seems to miss the kmers from the first read entirely. This is the corresponding content of the report file ("kuout.csv" from above):

% reads taxReads kmers dup cov taxID rank taxName
100 2 0 104 1.97 3.029e-09 1 no rank root
100 2 0 104 1.97 4.136e-07 10239 superkingdom Viruses
100 2 0 104 1.97 2.882e-06 2559587 clade Riboviria
100 2 0 104 1.97 3.825e-06 2732396 kingdom Orthornavirae
100 2 0 104 1.97 1.131e-05 2732408 phylum Pisuviricota
100 2 0 104 1.97 1.569e-05 2732506 class Pisoniviricetes
100 2 0 104 1.97 3.471e-05 76804 order Nidovirales
100 2 0 104 1.97 6.111e-05 2499399 suborder Cornidovirineae
100 2 0 104 1.97 6.111e-05 11118 family Coronaviridae
100 2 0 104 1.97 6.451e-05 2501931 subfamily Orthocoronavirinae
100 2 0 104 1.97 0.000212 694002 genus Betacoronavirus
100 2 0 104 1.97 0.001191 2509511 subgenus Sarbecovirus
100 2 0 104 1.97 0.001788 694009 species Severe acute respiratory syndrome-related coronavirus
100 2 2 104 1.97 0.003586 2697049 no rank Severe acute respiratory syndrome coronavirus 2

Given the HIGH RELEVANCE of the issure in terms of result quality, please answer to this issue asap and fix the potential bug...

Thanks and best regards,
Daniel


@A01245:144:HMV7FDSX3:1:2217:5963:30452/1
CAGCAACACAGTTGCTGATTCTCTTCCTGTTCCAAGCATAAACAGATGCAAATCTGGTGGCGTTAAAAACTTCACCAAAAGGGCACAAGTTTGTAATATTAGGAAATCTAACAATAGATTCTGTTGGTTGGTCTATAAAGTTAGAAGTGTG
+
FFFFFFFFFFFFFFFFF,:FFFFFFFFFFFFFFFF:FFFFFF:F::FF:FFF:F:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF::FFFF:FF,F,FFFFFFF:FFFFF:FFF,:FF,:,,,FFF,FFF,F::FF,FFFFFF,::,,,F
@a01246:144:HMV7FDSX3:1:2217:5963:30452/2
ACTTCTAACTTTATAGTCCAACCAACAGAATCTATTGTTAGATTTCCTAATATTACAAACTTGTGCCCTTTTGGTGAAGTTTTTAACGCCACCAGATTTGCATCTGTTTATGCTTGGAACAGGAAGAGAATCAGCAACTGTGTTGCTG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF::FFFFFF,FFFFFFF,FFFFFFFFFFFFFFFFFFFFF,FF:FFFFFFFFFFFFF,F,FFFFFFFF:FFFFFFFFF,FF

@alekseyzimin
Copy link
Collaborator

Hello, thank you for your report. We looked at the issue, and I believe the software works properly. The output has two columns containing k-mer counts. the "kmers" column refers to the number of distinct k-mers. The following column "dup" is the duplication ratio. The total number of the classified k-mers is the produce of these two columns 104*1,97=204.88 ~205.

@alekseyzimin
Copy link
Collaborator

As an experiment, I tried duplicating the last read. Here is my output. The number of distinct k-mers did not change, but the duplication ratio went up:
% reads taxReads kmers dup cov taxID rank taxName
100 3 0 104 2.97 2.963e-07 1 no rank root
100 3 0 104 2.97 2.963e-07 10239 superkingdom Viruses
100 3 0 104 2.97 2.163e-06 2559587 clade Riboviria
100 3 0 104 2.97 2.661e-06 2732396 kingdom Orthornavirae
100 3 0 104 2.97 9.088e-06 2732408 phylum Pisuviricota
100 3 0 104 2.97 1.286e-05 2732506 class Pisoniviricetes
100 3 0 104 2.97 3.039e-05 76804 order Nidovirales
100 3 0 104 2.97 5.454e-05 2499399 suborder Cornidovirineae
100 3 0 104 2.97 5.454e-05 11118 family Coronaviridae
100 3 0 104 2.97 5.724e-05 2501931 subfamily Orthocoronavirinae
100 3 0 104 2.97 0.0002109 694002 genus Betacoronavirus
100 3 0 104 2.97 0.001188 2509511 subgenus Sarbecovirus
100 3 0 104 2.97 0.001784 694009 species Severe acute respiratory syndrome-related coronavirus
100 3 3 104 2.97 0.003583 2697049 no rank Severe acute respiratory syndrome coronavirus 2

@pfeiferd
Copy link
Author

Thank you - great info. Sorry for my false report and the misunderstanding. Thanks as well for the quick answer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants