Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aggregation operations not working with many groups #7164

Open
abhi211199 opened this issue Apr 10, 2024 · 1 comment
Open

aggregation operations not working with many groups #7164

abhi211199 opened this issue Apr 10, 2024 · 1 comment
Labels
External Pull requests and issues from people who do not regularly contribute to modin question ❓ Questions about Modin

Comments

@abhi211199
Copy link

Hi, I'm new to modin.pandas, I'm trying to perform several aggregation operations on groups which were splitted from a dataframe using df.groupby(). The aggregation util works fine when I have less number of groups but fails to perform when it has more than 1000 groups.
To give an idea of the code using an example

import modin.pandas as pd
array1 = []
array2 = []
for i in range(1,2000):
     for j in range(1,10):
           array1.append(i)
           array2.append(j)

df=pd.DataFrame({'A':array1, 'B': array2})
groups = df.groupby('A')

def aggr(df):
      df['C']=df['B'].sum()
      return df

filter=[]
for A,B in groups:
     filter.append(aggr(B))

When executed, this goes on and on
Screenshot 2024-04-10 at 2 36 20 PM

but using simply pandas doesn't cause this issue.

Please let me know if I'm missing something that needs to be used while handling large number of groups. Thanks

@abhi211199 abhi211199 added question ❓ Questions about Modin Triage 🩹 Issues that need triage labels Apr 10, 2024
@anmyachev
Copy link
Collaborator

Hi @abhi211199, thanks for the question!

What version of Modin are you using?

Spilling occurs when there is not enough memory to store all objects in distributed storage. Therefore, there may be two options: you simply do not have enough RAM to process your amount of data, or Modin in such a case uses much more memory than is required. Please provide us with this information so that we can better understand the problem.

Please note that data clearing after spilling in the temporary folder does not occur automatically. This should be done manually, so there is a chance that the temporary folder was filled with data from previous runs, which prevented your last run from running successfully.

@anmyachev anmyachev added External Pull requests and issues from people who do not regularly contribute to modin and removed Triage 🩹 Issues that need triage labels Apr 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Pull requests and issues from people who do not regularly contribute to modin question ❓ Questions about Modin
Projects
None yet
Development

No branches or pull requests

2 participants