Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel_apply gets stuck #263

Open
2 tasks done
zeinabsobhani opened this issue Jan 18, 2024 · 4 comments
Open
2 tasks done

Parallel_apply gets stuck #263

zeinabsobhani opened this issue Jan 18, 2024 · 4 comments

Comments

@zeinabsobhani
Copy link

zeinabsobhani commented Jan 18, 2024

General

  • Operating System: mac os
  • Python version: Python 3.11.6
  • Pandas version: 2.1.4
  • Pandarallel version: 1.6.5

Acknowledgement

  • My issue is NOT present when using pandas without alone (without pandarallel)
  • If I am on Windows, I read the Troubleshooting page
    before writing a new bug report

Bug description

Observed behavior

I have 2 function that I'm running with parallel_apply on my dataframe. Here are the functions:

class Myclass:
  #*Method 1*
  @staticmethod
      def remove_newlines(txt):
          txt = re.sub(r'[\n]+','\n', txt)
          return txt
      
  def clean_text(self,txt):
      txt = self.remove_tags(txt)
      txt = self.remove_newlines(txt)
      return txt
  
  def clean_text_column(self, df_col):
      if self.parallel:
          pandarallel.initialize(progress_bar=True)
          df_col = df_col.parallel_apply(self.clean_text)
      else:
          df_col = df_col.apply(self.clean_text)
      return df_col
  
  
  
  #*Method 2*
  @staticmethod
      def get_tokenizer(model = 'cl100k_base'):
          return tiktoken.get_encoding(model)
  
  @staticmethod
  def get_tokens(text, tokenizer):
        tokens = tokenizer.encode(
            text,
            disallowed_special=()
        )
        return len(tokens)
  
  
  def get_tokens_column(self, df_col):
      tokenizer = self.get_tokenizer()
  
      if self.parallel:
          pandarallel.initialize(progress_bar=True)
          df_col = df_col.parallel_apply(self.get_tokens, args=(tokenizer,))
      else:
          # there is an issue with pandarallel here.
          df_col = df_col.apply(self.get_tokens, args = (tokenizer,))
      return df_col

The first method runs ok with parallel_apply, but the second method gets stuck at 0% without raising any error.
Screenshot 2024-01-17 at 4 53 08 PM

@nalepae
Copy link
Owner

nalepae commented Jan 23, 2024

Pandaral·lel is looking for a maintainer!
If you are interested, please open an GitHub issue.

@tveinot
Copy link

tveinot commented Feb 15, 2024

I'm having the same issues with Python 3.9 on Windows 10 Pro Pandarallel 1.6.5. In my case I have a very large spatial dataframe that is broken into chuncks using numpy array_split. The process runs through each chunk and does it's thing but the last chunk always has 2 processes frozen at 0% out of the 8.

@shermansiu
Copy link

@zeinabsobhani Could you please include a sample dataframe to accompany your functions, along with all of the imports, and most importantly, the code where you are running progress_apply? Providing a minimal code sample to reproduce the bug would be super useful.

@tveinot Do you still have this issue? If so, please open a separate issue and include a minimal but self-contained code snippet to reproduce the bug, along with a link to obtain the aforementioned large spatial dataframe. Opening a separate issue would help keep things organized, especially once we get in the technical weeds of trying to resolve both of your issues.

@tveinot
Copy link

tveinot commented Apr 27, 2024

@zeinabsobhani Sorry, this was a while ago for me and I had forgotten that I mentioned my issue here thinking it might be relevant. I either had an issue with a geometry causing the memory to over run (which is my personal belief), or rebuilding my environment and updating to QGIS 3.36 along with it's python components including geopandas somehow resolved it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants