Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLExtract() init really slow #129

Open
gilbd opened this issue May 19, 2022 · 0 comments
Open

URLExtract() init really slow #129

gilbd opened this issue May 19, 2022 · 0 comments

Comments

@gilbd
Copy link

gilbd commented May 19, 2022

Hi, while trying to use the URLEextract() in function to parse a dataframe column, it runs really slow.
Here is my code:

def extract_urls(last):
    extractor = URLExtract()
    count = 0
    for text in lst:
        urls_found = extractor.find_urls(text)
        if len(urls_found) > 0 and MY_URL in urls_found:
            count += len(urls_found)
    return count 

df['col2'] = df['col1'].apply(extract_url)

It takes a long time due to the loading time of the TLDs and the FileLocks.
Maybe you shall convert this object to Singleton?
Another idea is to load the TLDs just once by converting the TLDs object to Singleton.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant