Weirdly high memory usage with flox #346

ivirshup · 2024-03-26T15:55:41Z

I'm getting unexpectedly high memory usage with flox. Here's what I've been doing:

import dask.distributed as dd
import dask.array as da
import numpy as np
import flox

cluster = dd.LocalCluster(n_workers=3)
client = dd.Client(cluster)

M, N = 1_000_000, 20_000

X = da.random.normal(size=(M, N), chunks=(10_000, N))
by = np.random.choice(5_000, size=M)

res, codes = flox.groupby_reduce(
    X.T,
    by,
    func="sum",
    fill_value=0,
    method="map-reduce",
    reindex=True,
)

res_comp = res.compute()

This always warns about memory usage then fails on my dev machine with 64 gb of memory. However, I'm able to do plenty of other operations with an array this size (e.g. PCA, simple reductions). To me, a tree reduction here should be more than capable of handling this size of array.

Is this just me and my compute being odd, or do I have an incorrect expectation here?

cc: @ilan-gold

dcherian · 2024-03-26T17:05:37Z

You're starting with 1.5GiB chunk sizes on X. I would reduce that to the 200MB range. The bottleneck is usually numpy_groupies here. You should see input_validation prominently in the dask flamegraph

So I would also try installing numbagg. It'll be a bit slow to compile but should be faster and make less memory copies.

dcherian · 2024-03-26T17:10:00Z

Running this locally, I also spot a dask scheduling bug where it doesn't treat normal as a data generating task and runs way too many of them initially before reducing that data. Can you open a dask issue please?

dcherian · 2024-03-26T17:17:23Z

Ah I keep forgetting this numbagg only helps with nan-skipping aggregations, so it won't really help here.

I think this is a dask scheduling issue.

ivirshup · 2024-03-26T20:29:57Z

I also spot a dask scheduling bug where it doesn't treat normal as a data generating task and runs way too many of them initially before reducing that data.

In my real world use case, I get this just loading data from a zarr store.

I think this is a dask scheduling issue.

Me too, but I'm not sure why flox seems to be triggering it. In the dask issue I show that other tree aggregations with this array (X.sum(axis=0)) seem fine.

dcherian · 2024-03-26T20:31:41Z

Your last comment is important context! (the zarr bit in particular). I would add that to the other issue

dcherian mentioned this issue Mar 26, 2024

method heuristics: Avoid dot product as much as possible #347

Merged

ivirshup mentioned this issue Mar 26, 2024

Poor scheduling with flox, leading to high memory usage and eventual failure dask/dask#11026

Open

Thomas-Moore-Creative mentioned this issue Apr 24, 2024

install extra dependencies for numbagg Thomas-Moore-Creative/Climatology-generator-demo#15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Weirdly high memory usage with flox #346

Weirdly high memory usage with flox #346

ivirshup commented Mar 26, 2024

dcherian commented Mar 26, 2024 •

edited

dcherian commented Mar 26, 2024

dcherian commented Mar 26, 2024

ivirshup commented Mar 26, 2024

dcherian commented Mar 26, 2024

Weirdly high memory usage with flox #346

Weirdly high memory usage with flox #346

Comments

ivirshup commented Mar 26, 2024

dcherian commented Mar 26, 2024 • edited

dcherian commented Mar 26, 2024

dcherian commented Mar 26, 2024

ivirshup commented Mar 26, 2024

dcherian commented Mar 26, 2024

dcherian commented Mar 26, 2024 •

edited