-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent parts of VCF being reparsed #1138
base: main
Are you sure you want to change the base?
Conversation
@phofl Thanks for the chat about this issue yesterday. I tried using a simple |
There has been some discussion at dask/dask#10654 although no suggestions that meet the requirement of not redoing the work AND having a sensible progress bar. I suggest for now we use the fix with the confusing progress bar, as although it doesn't fill up, it does at least show the amount to-do dropping. |
36716d6
to
34123db
Compare
I've looked at the other format parsing methods and they both (bgen and plink) load the data into cluster memory, instead of writing parts to disk as the VCF parser does. Therefore this issue doesn't apply - you workers have to last until you save the dataset. |
Looks like failing tests as sometimes we don't have a dask client object - will work around. |
34123db
to
2bf5189
Compare
Can we close #1152 then? |
I think so - will comment there. |
I'm getting failures here due to missing coverage of one line 1250. But I can't see that I've changed anything that impacted that line? |
@tomwhite The line that isn't covered here is the last line in: if region is None:
variants = vcf
else:
variants = vcf(region) I don't think I've changed anything to uncover this line - do you know why this might not be covered? |
@benjeffery I can't see what could have caused this. To debug you could print a stacktrace for the unaltered code to see which code path causes that line to run (when |
Fix sgkit-dev/sgkit-publication#35 by telling dask that it can release a future for part of a VCF parse when it is complete. This prevents re-parsing when a worker is restarted.