Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KrakenUniq (read_merger.pl) fails when using symlinks as input #140

Open
boulund opened this issue Apr 19, 2023 · 0 comments
Open

KrakenUniq (read_merger.pl) fails when using symlinks as input #140

boulund opened this issue Apr 19, 2023 · 0 comments

Comments

@boulund
Copy link
Contributor

boulund commented Apr 19, 2023

I recently discovered that krakenuniq fails when the (paired) input files are symlinks:

I'm running the latest version installed from conda

(krakenuniq) [boulund@c1hitachi10 kutest]$ krakenuniq --version
KrakenUniq version 1.0.3                                       
Copyright 2017-2018, Florian Breitwieser (fbreitwieser@jhu.edu)
Copyright 2013-2017, Derrick Wood (dwood@cs.jhu.edu) for Kraken

I have a test folder with some small input files:

(krakenuniq) [boulund@c1hitachi10 kutest]$ ll                
total 57M                                                                                                                      
-rw-rw-r--. 1 boulund boulund  28M Apr 19 11:03 test1_1.fq.gz                                                                  
-rw-rw-r--. 1 boulund boulund  29M Apr 19 11:03 test1_2.fq.gz                                                                  
lrwxrwxrwx. 1 boulund boulund   13 Apr 19 11:15 test2_1.fq.gz -> test1_1.fq.gz                                                 
lrwxrwxrwx. 1 boulund boulund   13 Apr 19 11:15 test2_2.fq.gz -> test1_2.fq.gz                                                                                            

When running with the test1 input files, everything works as expected:

(krakenuniq) [boulund@c1hitachi10 kutest]$ krakenuniq --db /ceph/db/krakenuniq/minikraken_20171013_4GB --threads 20 --output te
st.kraken --report-file test.kreport --preload-size 128G --paired test1_1.fq.gz test1_2.fq.gz                                  
/ceph/home/boulund/.conda/envs/krakenuniq/share/krakenuniq-1.0.3-1/libexec/classify -d /ceph/db/krakenuniq/minikraken_20171013_
4GB/database.kdb -i /ceph/db/krakenuniq/minikraken_20171013_4GB/database.idx -t 20 -o test.kraken -x 128G -r test.kreport -a /c
eph/db/krakenuniq/minikraken_20171013_4GB/taxDB -p 12                                                                          
 Database /ceph/db/krakenuniq/minikraken_20171013_4GB/database.kdb                                                             
Loaded database with 313174697 keys with k of 31 [val_len 4, key_len 8].                                                       
Reading taxonomy index from /ceph/db/krakenuniq/minikraken_20171013_4GB/taxDB. Done.                                           
Writing Kraken output to test.kraken                                                                                           
 Processed 300000 sequences (database chunk 1 of 1)                                                                            
300000 sequences (81.26 Mbp) processed in 17.045s (1056.0 Kseq/m, 286.06 Mbp/m).                                               
  13379 sequences classified (4.46%)                                                                                           
  286621 sequences unclassified (95.54%)                                                                                       
Writing report file to test.kreport  ..                                                                                        
Reading genome sizes from /ceph/db/krakenuniq/minikraken_20171013_4GB/database.kdb.counts ... done                             
Setting values in the taxonomy tree ... done                                                                                   
Printing classification report ...  done                                                                                       
Report finished in 0.038 seconds.                                                                                              

When running with the symlinks (test2) it fails:

(krakenuniq) [boulund@c1hitachi10 kutest]$ krakenuniq --db /ceph/db/krakenuniq/minikraken_20171013_4GB --threads 20 --output test.kraken --report-file test.kreport --preload-size 128G --paired test2_1.fq.gz test2_2.fq.gz                                  
Warning: Overwriting test.kreport.                                                                                             
/ceph/home/boulund/.conda/envs/krakenuniq/share/krakenuniq-1.0.3-1/libexec/classify -d /ceph/db/krakenuniq/minikraken_20171013_4GB/database.kdb -i /ceph/db/krakenuniq/minikraken_20171013_4GB/database.idx -t 20 -o test.kraken -x 128G -r test.kreport -a /ceph/db/krakenuniq/minikraken_20171013_4GB/taxDB -p 12                                                                          
 Database /ceph/db/krakenuniq/minikraken_20171013_4GB/database.kdb                                                             
Loaded database with 313174697 keys with k of 31 [val_len 4, key_len 8].                                                       
Reading taxonomy index from /ceph/db/krakenuniq/minikraken_20171013_4GB/taxDB. Done.                                           
Writing Kraken output to test.kraken                                                                                           
classify: malformed fasta file - expected header char > not found                                                              
 Processed 0 sequences                                                                                                         
classify: malformed fasta file - expected header char > not found                                                              
0 sequences (0.00 Mbp) processed in 4.851s (0.0 Kseq/m, 0.00 Mbp/m).                                                           
  0 sequences classified (-nan%)                                                                                               
  0 sequences unclassified (-nan%)                                                                                             
Writing report file to test.kreport  ..                                                                                        
Reading genome sizes from /ceph/db/krakenuniq/minikraken_20171013_4GB/database.kdb.counts ... done                             
Setting values in the taxonomy tree ... done                                                                                   
Printing classification report ... total number of reads is zero - not creating a report!                                      
Report finished in 0.025 seconds.                                                                                              

While it is executing, a random temporary file is created (by read_merger.pl) that contains the following:

[boulund@c1hitachi10 kutest]$ ll
total 57M
-rw-rw-r--. 1 boulund boulund 155 Apr 19 11:16 01F67829.merged.err
-rw-rw-r--. 1 boulund boulund   0 Apr 19 11:16 01F67829.merged.fa
-rw-rw-r--. 1 boulund boulund 28M Apr 19 11:03 test1_1.fq.gz
-rw-rw-r--. 1 boulund boulund 29M Apr 19 11:03 test1_2.fq.gz
lrwxrwxrwx. 1 boulund boulund  13 Apr 19 11:15 test2_1.fq.gz -> test1_1.fq.gz
lrwxrwxrwx. 1 boulund boulund  13 Apr 19 11:15 test2_2.fq.gz -> test1_2.fq.gz
-rw-rw-r--. 1 boulund boulund   0 Apr 19 11:16 test.kraken
-rw-rw-r--. 1 boulund boulund 375 Apr 19 11:16 test.kreport
[boulund@c1hitachi10 kutest]$ cat 01F67829.merged.err 
Unknown file format for test2_1.fq.gz at /ceph/home/boulund/.conda/envs/krakenuniq/share/krakenuniq-1.0.3-1/libexec/read_merger.pl line 73, <$fh1> line 1.

It appears read_merger.pl fails when trying to identify the file type of a symlink.

I'm working around this issue in my workflow by simply merging my paired read files into a fasta file and running krakenuniq on that fasta file instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant