Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF parsing memory usage #1168

Open
benjeffery opened this issue Jan 11, 2024 · 7 comments
Open

VCF parsing memory usage #1168

benjeffery opened this issue Jan 11, 2024 · 7 comments

Comments

@benjeffery
Copy link
Collaborator

I'm attempting to parse some large VCFs. Initial attempts failed due to dask worker memory exhaustion. Will detail results of investigation here.

@benjeffery
Copy link
Collaborator Author

Instrumenting sgkit to log memory allocations in the VCF reader results in the following (with default chunk sizes:
Screenshot from 2024-01-11 15-20-29
(Forgive the screenshot this code is running in a secure environment that doesn't allow copy-paste)

These sum to 47GB and the sgkit process is using 48GB, so we have accounted for most RAM usage.

I will now see if chuck size changes have the expected effect.

@benjeffery
Copy link
Collaborator Author

As expected chunk_width has no effect as all the samples are read at once.

@benjeffery
Copy link
Collaborator Author

Halving the read_chunk_length has halved these numbers - although I think this could be misleading for peak ram as a chunk of temp_chunk_length has to be in-memory for the zarr append operation. I think it is worth getting an mprof plot of memory over time.

@jeromekelleher
Copy link
Collaborator

CALL_AD and CALL_PL are pretty weighty - is there an argument for doing these super heavy fields on their own in iterative passes? (OK you read in the VCF multiple times, but hey)

@benjeffery
Copy link
Collaborator Author

benjeffery commented Jan 11, 2024

I think it depends on the RAM and CPU tradeoff of a small read_chunk_length, and if that is enough with still having a larger temp_chunk_length. I'm trying to get some data on that now.

@hammer
Copy link
Contributor

hammer commented Jan 11, 2024

When the investigation resolves, would you mind documenting how you instrumented sgkit to log memory allocations in the VCF reader? I’d like to do the same for PLINK.

@benjeffery
Copy link
Collaborator Author

benjeffery commented Jan 16, 2024

Ok, I've done some further digging with mprof:

Here is vcf_to_zarr_sequential (so no dask involved) parsing a chunk of 10,000 variants:
r2000l2000
This is with a read_chunk_length==chunk_length==2000. As expected most of the ram is in read chunk storage (the cyan part) and then an additional RAM requirement when that chunk is flushed to disk in to_zarr (green and yellow).

Reducing the read chunk costs a little bit of time but saves some RAM, here is the same parse, with read_chunk_length==500:
r500l5000

I'm not sure I understand this, but increasing chunk_length to 5000 doesn't add much memory usage:
r500l2500

Will double check some things here to make sure that change has propagated.

This looks like I'll be able to parse this large VCF, but I may need to do some FORMAT fields in a second parse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants