Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize split_every #279

Open
dcherian opened this issue Oct 20, 2023 · 4 comments
Open

Optimize split_every #279

dcherian opened this issue Oct 20, 2023 · 4 comments

Comments

@dcherian
Copy link
Collaborator

dcherian commented Oct 20, 2023

The split_every parameter controls how many blocks are combined at every combine stage.

If we know where the group labels are (so number of groups in a block, and number of elements in a group in each block), we can estimate memory use of the intermediates for a given reduction, and optimize split_every to reduce the graph size.

@mrocklin do you think this would be a decent win?

@tomwhite Is there a version of this in cubed?

@mrocklin
Copy link

In the dataframe world split_every is less about reducing graph size, and more about reducing waiting on a single machine. It tends to be really important for clusters with many workers, and datasets with many partitions. In practice, any moderate value is fine. 8-32 is probably typical.

The graph size thing here tends not to be a big deal. We're talking about at most n tasks.

I may not understand the situation in the xarray case though.

@dcherian
Copy link
Collaborator Author

Interesting. Are there any network considerations? Do we like to send many small things over the network over fewer bigger things? Seems like the latter unless the workers end up waiting for too long?

That National Water Model workload would be a good thing to experiment with. It has both many workers and reduces over many partitions. IIRC we can set split_every with the context manager?

@mrocklin
Copy link

Mostly workers don't like it if 100 other machines try to send them stuff all at the same time. It's better to have other folks accept batches, reduce them down, and then pass on.

The negative side of this is that there are more sequential hops. But only log_k as many.

And then obviously if your intermediates are large, then you won't want to send very many of them to one task.

@mrocklin
Copy link

That National Water Model workload would be a good thing to experiment with. It has both many workers and reduces over many partitions. IIRC we can set split_every with the context manager

I know a place where it's easy to experiment 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants