Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting wrong URL when there is dot before url #36

Open
warproxxx opened this issue Mar 3, 2019 · 8 comments
Open

Getting wrong URL when there is dot before url #36

warproxxx opened this issue Mar 3, 2019 · 8 comments

Comments

@warproxxx
Copy link

For this text:
extractor.find_urls("My name is claim...https://t.co/SZlazvFzYx")

URL extractor returns:
['claim...https://t.co/SZlazvFzYx']

@lipoja
Copy link
Owner

lipoja commented Mar 3, 2019

Thanks for reporting this issue. I see the problem but it is not that easy to solve. The main reason is that the string is not typographically correct. There should be space after dot (or three dots).

I will keep it in mind when I make some changes. Honestly, I will try to solve other issues first and then this one.

Or if you have some free time you can send me pull request with a fix. I would appreciate any help!
Thank you!

@warproxxx
Copy link
Author

I have very non elegant solution which involves using regex to remove more than 1 dots that occur at the same time. Will this be an alright fix? I will send a pull request if it is

@lipoja
Copy link
Owner

lipoja commented Mar 5, 2019

Well, it depends where does the regexp remove the dots? Just removing of more that two dots in global is not general solution and may give incorrect results.
You have to be sure that you are removing dots which are not part of the URL.

@Larrax
Copy link

Larrax commented Jun 24, 2019

I have run into many incorrectly extracted URLs because of this issue. What's more, the dot is not the only problem. It's also the at sign, colon, plus, etc. With the following input...

Visit us @www.example.com
Visit our website:www.example.com
Visit our website-www.example.com
Visit our website*www.example.com
Visit our website+www.example.com
Visit our website...www.example.com
Nonsense URL = '.example.com'

find_urls outputs this list...

@www.example.com
website:www.example.com
website-www.example.com
website*www.example.com
website+www.example.com
website...www.example.com
.example.com

And there might be more.

@nicolasassi
Copy link

Any updates on that matter?

@lipoja
Copy link
Owner

lipoja commented Jun 16, 2021

@nicolasassi Not right now.
I do not have clean solution for it now. It might be due to lack of free time that I can spend on urlextract.

However if somebody has an idea how to solve this issue properly then feel free to start a discussion or create PR to discuss the actual code.


@nicolasassi what I am referring to is that I do not have good way how to determine when the domain starts when there is no white space. I can not figure out any good general solution. But if you know what you are parsing. You can pre-process the text or you can try to use set_stop_chars_left() and add stop characters that are specific to your use-case.

Do you think it would help in your case?

@nicolasassi
Copy link

nicolasassi commented Jun 16, 2021

@lipoja this is a trick problem indeed... In my opinion pre-process the text or use set_stop_chars_left() kind of defeats the porpouse of the lib. What do you think?

I'm gonna put thought on this problem... maybe we can find way to fix that in the parser.

@lipoja
Copy link
Owner

lipoja commented Jun 16, 2021

I am looking on this library as a tool that will return you as much domains as it founds even when they are "wrong" and there needs to be some post-processing.

We are trying to cover all general issues without limiting the number of returned results. Of course the results should be correct if possible. But I would rather return this domain that contains other text (e.g. website+www.example.com) rather then returning nothing at all. So users at least can see what was found and tune their parser or do some filtering.

But I would really appreciate any help in any form (discussion on some ideas, PRs, .... ).

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants