-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sgkit conversion for 1M samples taking too long #35
Comments
There's lot of this warning:
|
I thought I had tried some of the more obvious things like tweaking the temp sizes, but maybe not. |
Aha, I did in #36 |
So there are three chunk lengths to set here. I explain them in https://github.com/pystatgen/sgkit/pull/1044#issuecomment-1458068866 the upshot of which is that |
Any suggestions for starting point values for 1M samples and 256G of RAM? |
It's hard as it is VCF-content dependant, but you should find that halving |
Currently I have this: client = dask.distributed.Client(n_workers=12, threads_per_worker=1)
print(client)
# Tried this, doesn't work:
# with dask.diagnostics.ProgressBar():
# Need temp_chunk_length for 1M samples.
sgkit.io.vcf.vcf_to_zarr(infile, outfile, temp_chunk_length=100,
read_chunk_length=1000, tempdir="data/tmp/") which seems to be fitting into the available RAM fairly comfortably. However, it's producing gajillions of temporary files (to the point where I can't even count them all in a reasonable time). Currently there are 457 "part" zarrs, and each of these has 10s of thousands of tiny files. E.g.:
Because each of these files is < 4K, we're using up a huge amount of temporary storage space. Any guidance here @tomwhite @benjeffery? I think we really need to make this a bit easier, as our key application is working on large datasets. |
Surely knowing the number of variants and number of samples will give a reasonable first approximation? |
You can reduce the number of files by having a larger |
Latest update: having too small a read chunk length seems to make it very slow https://github.com/pystatgen/sgkit/issues/1128 |
Latest 😢
I think that failed after about 12 hours. This is the code: sgkit.io.vcf.vcf_to_zarr(infile, outfile,
temp_chunk_length=500,
read_chunk_length=2000,
chunk_length=10_000,
tempdir="tmp/") We must be doing something wrong here, this should be more robust... |
I'm not sure if this is the problem you are hitting, but I would expect
This is because Also |
Another thing that is useful when debugging is to break Confusingly, |
After a discussion with @benjeffery, we tried this:
which seemed to be working OK for a few days. However, the memory usage seems to keep creeping up, and it looks like an error happened about 2/3 of the way through the vcf reading phase and it seems to have restarted the whole thing. (I see a few "erred" blocks when I look at the graph) I guess I just have to try using fewer threads? I'm currently using 10 threads on a machine with 256G of RAM. |
I would be good to know what the errors are that caused those tasks to fail if you have the worker logs. |
I should do - how do I get them? |
Rather than using fewer threads, would it be possible to run with more resources on a cluster? A computation that takes days is more painful to debug. What is the size of the VCF in bytes, and number of variants? Is it synthetic? Regarding the memory usage creeping up, there is a recent Dask issue (hard to say if this is related): dask/distributed#8164. And regarding the computation being restarted, there's some recent discussion here: dask/distributed#8182 (again, not sure how relevant this is). |
Thanks @tomwhite. I don't really have time to get this set up on a cluster I'm afraid. I'm trying again with 8 threads and seeing if it gets through over the weekend. The VCF is a 1M subset of the simulated French-Canadian dataset from this paper It has 7.2 M variants and the BCF file is 52G. This is a simulation of chromosome 21 - a big chromosome would be at least 10X larger. |
For the ukb and GeL work I've been using 200-400 threads and it still takes hours, for 10% the samples you have. With 8 threads I'd think it would take more than a weekend. Setting up the cluster shouldn't be too hard, using https://jobqueue.dask.org/en/latest/generated/dask_jobqueue.SGECluster.html#dask_jobqueue.SGECluster |
It's 10% through after a few hours, and there's no sign of memory pressure, so this really should work. I think it's important to not just throw a cluster at this - these are more than adequate computational resources and we should be able to do the conversion. We'll put people off sgkit before they start if they can't convert their VCFs without a cluster. |
Here's a screenshot of the Dask dashboard after running over the weekend. What seems to be happening is that the workers keep running tasks (each taking about 20 minutes), slowly increasing their memory usage until eventually getting killed. Which would be fine, except it seems that the computation ends up starting from scratch each time. The transition log seems to imply that tasks are getting done multiple times: So, it seems like there's a couple of problems here:
Any ideas on why (2) is happening? |
We had a chat about this, but I think it's worth recording it here. We've configured the workers to restart every hour with the |
Hmm, trying to expire the workers didn't work. Here's some snippets from the log:
This was the code: client = dask.distributed.Client(
n_workers=16, threads_per_worker=1,
lifetime="100m", lifetime_stagger="5m",
lifetime_restart=True) |
It does seem like the tasks never getting "done" here is the major problem? |
What was progress like? Were any getting completed? Odd that "68 tasks could not be moved as there are no suitable workers to receive them. The worker will not be retired." I don't understand that. I'm running the worker with |
I don't think it got anywhere, it died on the first attempt to turn over the workers AFAICT. |
Getting Running: def doit():
vcf = VCF(sys.argv[1])
vcf.close()
del vcf
for i in range(500):
doit() Results in growing memory usage. Tracking down the leak in cyvcf2/htslib. |
Found it. See https://github.com/pystatgen/sgkit/pull/1129 @jeromekelleher try @savitakartik I think this may explain why your larger chromosomes were failing. |
On testing memory growth is reduced, but not eliminated. Looks like there is an additional leak in cyvcf2. |
That's what I'm observing as well. Currently half-way through converting the VCF chunks with no errors, but memory is growing. Any theories as to why Dask is repeating the already-done tasks? This seems at least as bad as the memory leak. |
I thought that was due to workers getting killed and the (tiny) result object being lost. We might be able to get around that with the I have confirmed another leak in cyvcf2 in the |
And it's still failing in the same way :sadpanda: We just got a bit further this time. |
The non-recovery bit is the killer here - memory leaks are bad, but Dask should let us recover from this. I wonder if it's worth inspecting the dask graph we're getting when when we call compute here? Maybe dask is doing something a bit weird when aggregating all the tasks, and we should explicitly create a single root task instead? Or perhaps setting |
There's no harm in inspecting the graph, but this is embarrassingly parallel so should just be a bunch of distinct nodes. Another way to make progress (and separate the Dask problem from the memory leak problem) would be to adapt the code in https://github.com/pystatgen/sgkit/issues/1130 to run using a ProcessPoolExecutor. |
Would that be resilient to failures due to memory leaks @tomwhite? I guess we could use |
It wouldn't be resilient to memory leaks, so setting |
I think a pragmatic solution here might be to let dask redo the tasks, but have the task itself realise that the zarr chunk has already been written and to return immediately. This could be done by writing a "success" file just after the zarr chunk is written out, then checking for its existence before starting the parse. I'll code up a concept PR for this. |
Ok @jeromekelleher please try |
I'm ok with this as a way to move forward, but it seems that there is still a major bug lurking (possibly in Dask) that needs getting to the bottom of. |
I finally got this to work 🎉 I thought initially that it had failed again, given the output:
but it seems that the process actually exited sucessfully and we have an sgkit dataset of the right size. Definitely room for improvement here... |
@benjeffery I wonder if the suggestions here would help with the side effects problem: dask/distributed#8164 (comment) |
Matt Rocklin over at dask/dask#10654 suggests using |
I think we should close this issue here because it's an sgkit problem, not a particular issue here on this repo. Have we captured the context required in sgkit issues @benjeffery? |
I've written it up at https://github.com/pystatgen/sgkit/issues/1154 as there was no issue on sgkit specifically for this. |
Running the sgkit conversion for 1M samples is taking weeks, and it looks like it's spilling to disk a lot. It needs some tweaking, so that it'll run on my server (256G of RAM).
Any pointers on what to do here @tomwhite, @benjeffery? code is here: https://github.com/pystatgen/sgkit-publication/blob/c9fe141764abd5b016d39d09c90f6b99b0d30a3c/src/convert.py#L27
The text was updated successfully, but these errors were encountered: