Prevent parts of VCF being reparsed #1138

benjeffery · 2023-10-11T11:15:11Z

Fix sgkit-dev/sgkit-publication#35 by telling dask that it can release a future for part of a VCF parse when it is complete. This prevents re-parsing when a worker is restarted.

benjeffery · 2023-10-11T12:08:41Z

@phofl Thanks for the chat about this issue yesterday. I tried using a simple client.gather but this still resulted in the futures being redone on worker failure. However, after poking around the docs I realised cancelling the future after success would prevent its re-execution. Let me know what you think of this approach. Thanks again!

benjeffery · 2023-12-04T15:43:09Z

There has been some discussion at dask/dask#10654 although no suggestions that meet the requirement of not redoing the work AND having a sensible progress bar. I suggest for now we use the fix with the confusing progress bar, as although it doesn't fill up, it does at least show the amount to-do dropping.

benjeffery · 2023-12-06T13:54:32Z

I've looked at the other format parsing methods and they both (bgen and plink) load the data into cluster memory, instead of writing parts to disk as the VCF parser does. Therefore this issue doesn't apply - you workers have to last until you save the dataset.

benjeffery · 2023-12-06T14:06:48Z

Looks like failing tests as sometimes we don't have a dask client object - will work around.

jeromekelleher · 2023-12-06T14:22:13Z

I've looked at the other format parsing methods and they both (bgen and plink) load the data into cluster memory, instead of writing parts to disk as the VCF parser does. Therefore this issue doesn't apply - you workers have to last until you save the dataset.

Can we close #1152 then?

benjeffery · 2023-12-06T14:31:58Z

I think so - will comment there.

benjeffery · 2023-12-06T14:42:16Z

I'm getting failures here due to missing coverage of one line 1250. But I can't see that I've changed anything that impacted that line?

benjeffery · 2023-12-07T12:42:39Z

@tomwhite The line that isn't covered here is the last line in:

        if region is None:
            variants = vcf
        else:
            variants = vcf(region)

I don't think I've changed anything to uncover this line - do you know why this might not be covered?

tomwhite · 2023-12-18T14:51:10Z

@benjeffery I can't see what could have caused this. To debug you could print a stacktrace for the unaltered code to see which code path causes that line to run (when region is not None), which may give a clue to what's happening.

benjeffery mentioned this pull request Oct 13, 2023

Skip parts of a VCF that have already been done #1132

Closed

jeromekelleher mentioned this pull request Dec 5, 2023

Parts of work redone when a dask worker dies #1154

Open

benjeffery force-pushed the fix-vcf-redo branch from 36716d6 to 34123db Compare December 6, 2023 13:47

Prevent parts of VCF being reparsed

2bf5189

benjeffery force-pushed the fix-vcf-redo branch from 34123db to 2bf5189 Compare December 6, 2023 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent parts of VCF being reparsed #1138

Prevent parts of VCF being reparsed #1138

benjeffery commented Oct 11, 2023

benjeffery commented Oct 11, 2023

benjeffery commented Dec 4, 2023 •

edited

benjeffery commented Dec 6, 2023

benjeffery commented Dec 6, 2023

jeromekelleher commented Dec 6, 2023

benjeffery commented Dec 6, 2023

benjeffery commented Dec 6, 2023

benjeffery commented Dec 7, 2023

tomwhite commented Dec 18, 2023

Prevent parts of VCF being reparsed #1138

Are you sure you want to change the base?

Prevent parts of VCF being reparsed #1138

Conversation

benjeffery commented Oct 11, 2023

benjeffery commented Oct 11, 2023

benjeffery commented Dec 4, 2023 • edited

benjeffery commented Dec 6, 2023

benjeffery commented Dec 6, 2023

jeromekelleher commented Dec 6, 2023

benjeffery commented Dec 6, 2023

benjeffery commented Dec 6, 2023

benjeffery commented Dec 7, 2023

tomwhite commented Dec 18, 2023

benjeffery commented Dec 4, 2023 •

edited