VCF parsing memory usage #1168

benjeffery · 2024-01-11T15:19:19Z

I'm attempting to parse some large VCFs. Initial attempts failed due to dask worker memory exhaustion. Will detail results of investigation here.

benjeffery · 2024-01-11T15:25:52Z

Instrumenting sgkit to log memory allocations in the VCF reader results in the following (with default chunk sizes:

(Forgive the screenshot this code is running in a secure environment that doesn't allow copy-paste)

These sum to 47GB and the sgkit process is using 48GB, so we have accounted for most RAM usage.

I will now see if chuck size changes have the expected effect.

benjeffery · 2024-01-11T15:34:41Z

As expected chunk_width has no effect as all the samples are read at once.

benjeffery · 2024-01-11T15:52:38Z

Halving the read_chunk_length has halved these numbers - although I think this could be misleading for peak ram as a chunk of temp_chunk_length has to be in-memory for the zarr append operation. I think it is worth getting an mprof plot of memory over time.

jeromekelleher · 2024-01-11T16:31:48Z

CALL_AD and CALL_PL are pretty weighty - is there an argument for doing these super heavy fields on their own in iterative passes? (OK you read in the VCF multiple times, but hey)

benjeffery · 2024-01-11T16:44:41Z

I think it depends on the RAM and CPU tradeoff of a small read_chunk_length, and if that is enough with still having a larger temp_chunk_length. I'm trying to get some data on that now.

hammer · 2024-01-11T18:53:08Z

When the investigation resolves, would you mind documenting how you instrumented sgkit to log memory allocations in the VCF reader? I’d like to do the same for PLINK.

benjeffery · 2024-01-16T10:37:12Z

Ok, I've done some further digging with mprof:

Here is vcf_to_zarr_sequential (so no dask involved) parsing a chunk of 10,000 variants:

This is with a read_chunk_length==chunk_length==2000. As expected most of the ram is in read chunk storage (the cyan part) and then an additional RAM requirement when that chunk is flushed to disk in to_zarr (green and yellow).

Reducing the read chunk costs a little bit of time but saves some RAM, here is the same parse, with read_chunk_length==500:

I'm not sure I understand this, but increasing chunk_length to 5000 doesn't add much memory usage:

Will double check some things here to make sure that change has propagated.

This looks like I'll be able to parse this large VCF, but I may need to do some FORMAT fields in a second parse.

hammer mentioned this issue Feb 5, 2024

Sequential VCF parsing #1183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCF parsing memory usage #1168

VCF parsing memory usage #1168

benjeffery commented Jan 11, 2024

benjeffery commented Jan 11, 2024

benjeffery commented Jan 11, 2024

benjeffery commented Jan 11, 2024

jeromekelleher commented Jan 11, 2024

benjeffery commented Jan 11, 2024 •

edited

hammer commented Jan 11, 2024

benjeffery commented Jan 16, 2024 •

edited

VCF parsing memory usage #1168

VCF parsing memory usage #1168

Comments

benjeffery commented Jan 11, 2024

benjeffery commented Jan 11, 2024

benjeffery commented Jan 11, 2024

benjeffery commented Jan 11, 2024

jeromekelleher commented Jan 11, 2024

benjeffery commented Jan 11, 2024 • edited

hammer commented Jan 11, 2024

benjeffery commented Jan 16, 2024 • edited

benjeffery commented Jan 11, 2024 •

edited

benjeffery commented Jan 16, 2024 •

edited