Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandarallel very slow after loading huge dataframe #256

Open
2 tasks done
lpuglia opened this issue Nov 11, 2023 · 2 comments
Open
2 tasks done

Pandarallel very slow after loading huge dataframe #256

lpuglia opened this issue Nov 11, 2023 · 2 comments

Comments

@lpuglia
Copy link

lpuglia commented Nov 11, 2023

General

  • Operating System: Ubuntu
  • Python version: 3.11
  • Pandas version: 2.0.3
  • Pandarallel version: main branch

Acknowledgement

  • My issue is NOT present when using pandas without alone (without pandarallel)
  • If I am on Windows, I read the Troubleshooting page
    before writing a new bug report

Bug description

initialization:

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

def custom_function(row):
    return row['column1']+row['column2']

df2 = pd.read_csv('small.csv') # some MBs files

the following code:

df2['column2'] = df2.parallel_apply(custom_function, axis=1)

takes about 5 seconds to run, while if i do:

import pandas as pd
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)

def custom_function(row):
    return row['column1']+row['column2']

df1 = pd.read_csv('huge.csv') # hundreds of GBs csv file
df2 = pd.read_csv('small.csv') # some MBs files

the following code:

df2['column2'] = df2.parallel_apply(custom_function, axis=1)

takes about 1 minute to run.

Is this by design? why would loading a huge dataframe impact the runtime on the second dataframe? is there some hidden state that i'm not considering? is there a way to avoid the issue?

@nalepae
Copy link
Owner

nalepae commented Jan 23, 2024

Pandaral·lel is looking for a maintainer!
If you are interested, please open an GitHub issue.

@shermansiu
Copy link

Can you check your RAM usage with htop? It's possible that your computer slowed down because it was holding a huge dataframe in-memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants