Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent parts of VCF being reparsed #1138

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

benjeffery
Copy link
Collaborator

Fix sgkit-dev/sgkit-publication#35 by telling dask that it can release a future for part of a VCF parse when it is complete. This prevents re-parsing when a worker is restarted.

@benjeffery
Copy link
Collaborator Author

@phofl Thanks for the chat about this issue yesterday. I tried using a simple client.gather but this still resulted in the futures being redone on worker failure. However, after poking around the docs I realised cancelling the future after success would prevent its re-execution. Let me know what you think of this approach. Thanks again!

@benjeffery
Copy link
Collaborator Author

benjeffery commented Dec 4, 2023

There has been some discussion at dask/dask#10654 although no suggestions that meet the requirement of not redoing the work AND having a sensible progress bar. I suggest for now we use the fix with the confusing progress bar, as although it doesn't fill up, it does at least show the amount to-do dropping.

@benjeffery
Copy link
Collaborator Author

I've looked at the other format parsing methods and they both (bgen and plink) load the data into cluster memory, instead of writing parts to disk as the VCF parser does. Therefore this issue doesn't apply - you workers have to last until you save the dataset.

@benjeffery
Copy link
Collaborator Author

Looks like failing tests as sometimes we don't have a dask client object - will work around.

@jeromekelleher
Copy link
Collaborator

I've looked at the other format parsing methods and they both (bgen and plink) load the data into cluster memory, instead of writing parts to disk as the VCF parser does. Therefore this issue doesn't apply - you workers have to last until you save the dataset.

Can we close #1152 then?

@benjeffery
Copy link
Collaborator Author

I think so - will comment there.

@benjeffery
Copy link
Collaborator Author

I'm getting failures here due to missing coverage of one line 1250. But I can't see that I've changed anything that impacted that line?

@benjeffery
Copy link
Collaborator Author

@tomwhite The line that isn't covered here is the last line in:

        if region is None:
            variants = vcf
        else:
            variants = vcf(region)

I don't think I've changed anything to uncover this line - do you know why this might not be covered?

@tomwhite
Copy link
Collaborator

@benjeffery I can't see what could have caused this. To debug you could print a stacktrace for the unaltered code to see which code path causes that line to run (when region is not None), which may give a clue to what's happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sgkit conversion for 1M samples taking too long
3 participants