Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't process gzipped fastq #35

Open
ohthetrees opened this issue May 17, 2018 · 12 comments
Open

Can't process gzipped fastq #35

ohthetrees opened this issue May 17, 2018 · 12 comments
Assignees

Comments

@ohthetrees
Copy link

Hi, I'm just getting started with Nonpareil, thanks for your work.

I'm unable to process my gzipped fastq. If I first uncompress the file, it processes as expected. The error:

$ nonpareil -s ETNP_120m_R2.name.fastq.gz -t 4 -T kmer -f fastq -b ETNP_120m_R2.nonpareil.k
Nonpareil v3.301
Fatal error:
The file provided does not have the proper fastq format
 [      0.0] Fatal error: The file provided does not have the proper fastq format
@lmrodriguezr
Copy link
Owner

Sorry for the loooong delay, I'm back now at tending to the issues.

I believe this is an issue on the kmer kernel, that doesn't allow gzipped input due to the random access function it uses (@gunturus please comment if I'm wrong).

Unfortunately, I don't think this can be easily resolved. I'll leave this issue open until I add a corresponding comment to the documentation, but you'll have to unzip the fastq file prior to using nonpareil.

@lmrodriguezr lmrodriguezr self-assigned this Aug 28, 2019
@jfy133
Copy link

jfy133 commented Nov 6, 2020

I'm starting to investigate nonpareil, and also had the same issue.

Having gzipped input support would be very useful to have, because I have >100 sequencing files all in >1GB file-size range, so having to decompress each time would be a bit nasty when trying to parallelise processing all the files at once.

So I would like to give support to this, if a solution is feasible (even if there is an internal temporary decompression)!

@lmrodriguezr
Copy link
Owner

@gunturus Do you have an update on this issue? I know you were looking into it. Thanks!

@jfy133
Copy link

jfy133 commented Feb 25, 2021

@gunturus do you have any more news? I'm interested in potentially adding nonpariel to the nf-core/eager pipeline, but the lack of gzip support is unfortunately a deal breaker...

@gunturus
Copy link
Collaborator

@jfy133 unfortunately gzip is not supported. @lmrodriguezr do you have any suggestions to provide gzip support? I have no idea.

@jfy133
Copy link

jfy133 commented Jun 9, 2021

Do you think this is in anyway on a roadmap @lmrodriguezr? Just to know if I should look for different solutions instead.

@VGalata
Copy link

VGalata commented Aug 24, 2021

I would also like to add that having support for compressed FASTQ files would be good.

@lmrodriguezr
Copy link
Owner

Hello. We're finally back at this issue, and it's top of the roadmap. An initial not-so-clean solution would be to unzip the files into a temporary directory, launch nonpareil, and then remove the directory. Would this work as a temporary solution? If yes, I can implement it into a bash wrapper so you could use it out of the box.

A more robust solution is to read directly from the zipped file, but this will take some heavy lifting because we will need to replace a random file access with another method. It's also doable, but I'll take us a bit longer, so hopefully the first option works in the meantime?

@VGalata
Copy link

VGalata commented Feb 10, 2022

Dear @lmrodriguezr,

Thank you very much for looking into this!

For our purpose, having the second option being implemented would be better. We use nonpareil in a snakemake workflow where we want to move away from using unzipped FASTQ files and we would like to avoid unnecessary unzipping if possible. And, as you are saying it yourself, that would be also a more robust solution and I think it would be worth waiting for it.

@jfy133
Copy link

jfy133 commented Feb 10, 2022

@lmrodriguezr we are in the same situtation as @VGalata as we would like to add it to a nextflow pipeline ;).

However, I think unzipping to a /tmp location & automatic cleanup after might be an OK temporary workaround, as at then at least we ourselves don't then have to deal with the unzipping itself. On the otherhand this depends on the implementatoin, and whether you rely on an internal unzipping library within the bash script, or rely on a tool already used on a users machine (which is much more flaky, unfortunately as it's this is often frustratingly not very portable).

But depending on the time it takes for the more robust solution, I guess I would prefer to wait a bit longer (thus time investment) goes into an 'inbuilt' solution.

@jfy133
Copy link

jfy133 commented Mar 14, 2022

@lmrodriguezr just another thought... would it be easier to refactor input to allow stdin?

then could simply to zcat <fastq>.gz | nonpareil <additional params?

Just sayin' as also would be fine with me in terms of accepting gzipped input in terms of useability.

@davidecarlson
Copy link

Just wanted to chime in with more support for enabling compressed fastq files!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants