Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Heavy data skew after groupby #45303

Open
bveeramani opened this issue May 13, 2024 · 0 comments
Open

[Data] Heavy data skew after groupby #45303

bveeramani opened this issue May 13, 2024 · 0 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks

Comments

@bveeramani
Copy link
Member

What happened + What you expected to happen

I called map_groups after a groupby, and observed that most of the tasks finished immediately, but a few tasks continued to run for a very long time. The reason is that there's heavy data skew after the groupby (see repro).

Versions / Dependencies

0702f7c

Reproduction script

import ray

refs = (
    ray.data.read_parquet("data.parquet")
    .sort(["spam", "ham"])

    .get_internal_block_refs()
)
print([len(ray.get(ref)) for ref in refs])
# Blocks either contain ~500 or ~8500 rows
# [620, 8495, 647, 8623, 8570, 314, 8308, 8391, 443, 8579, 395, 8498, 8628, 378, 8522, 8510, 312, 8596, 355, 11]            

data.parquet.zip

Issue Severity

None

@bveeramani bveeramani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) P1 Issue that should be fixed within a few weeks data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2024
@bveeramani bveeramani self-assigned this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

1 participant