Getting wrong URL when there is dot before url #36

warproxxx · 2019-03-03T06:42:15Z

For this text:
extractor.find_urls("My name is claim...https://t.co/SZlazvFzYx")

URL extractor returns:
['claim...https://t.co/SZlazvFzYx']

The text was updated successfully, but these errors were encountered:

lipoja · 2019-03-03T12:36:41Z

Thanks for reporting this issue. I see the problem but it is not that easy to solve. The main reason is that the string is not typographically correct. There should be space after dot (or three dots).

I will keep it in mind when I make some changes. Honestly, I will try to solve other issues first and then this one.

Or if you have some free time you can send me pull request with a fix. I would appreciate any help!
Thank you!

warproxxx · 2019-03-04T11:16:04Z

I have very non elegant solution which involves using regex to remove more than 1 dots that occur at the same time. Will this be an alright fix? I will send a pull request if it is

lipoja · 2019-03-05T21:06:51Z

Well, it depends where does the regexp remove the dots? Just removing of more that two dots in global is not general solution and may give incorrect results.
You have to be sure that you are removing dots which are not part of the URL.

Larrax · 2019-06-24T07:44:39Z

I have run into many incorrectly extracted URLs because of this issue. What's more, the dot is not the only problem. It's also the at sign, colon, plus, etc. With the following input...

Visit us @www.example.com
Visit our website:www.example.com
Visit our website-www.example.com
Visit our website*www.example.com
Visit our website+www.example.com
Visit our website...www.example.com
Nonsense URL = '.example.com'

find_urls outputs this list...

@www.example.com
website:www.example.com
website-www.example.com
website*www.example.com
website+www.example.com
website...www.example.com
.example.com

And there might be more.

nicolasassi · 2021-06-16T13:13:24Z

Any updates on that matter?

lipoja · 2021-06-16T13:23:41Z

@nicolasassi Not right now.
I do not have clean solution for it now. It might be due to lack of free time that I can spend on urlextract.

However if somebody has an idea how to solve this issue properly then feel free to start a discussion or create PR to discuss the actual code.

@nicolasassi what I am referring to is that I do not have good way how to determine when the domain starts when there is no white space. I can not figure out any good general solution. But if you know what you are parsing. You can pre-process the text or you can try to use set_stop_chars_left() and add stop characters that are specific to your use-case.

Do you think it would help in your case?

nicolasassi · 2021-06-16T13:37:12Z

@lipoja this is a trick problem indeed... In my opinion pre-process the text or use set_stop_chars_left() kind of defeats the porpouse of the lib. What do you think?

I'm gonna put thought on this problem... maybe we can find way to fix that in the parser.

lipoja · 2021-06-16T14:01:16Z

I am looking on this library as a tool that will return you as much domains as it founds even when they are "wrong" and there needs to be some post-processing.

We are trying to cover all general issues without limiting the number of returned results. Of course the results should be correct if possible. But I would rather return this domain that contains other text (e.g. website+www.example.com) rather then returning nothing at all. So users at least can see what was found and tune their parser or do some filtering.

But I would really appreciate any help in any form (discussion on some ideas, PRs, .... ).

Thank you!

jayvdb mentioned this issue Apr 5, 2020

pypidb issues #68

Open

lipoja added bug medium labels Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting wrong URL when there is dot before url #36

Getting wrong URL when there is dot before url #36

warproxxx commented Mar 3, 2019

lipoja commented Mar 3, 2019

warproxxx commented Mar 4, 2019

lipoja commented Mar 5, 2019

Larrax commented Jun 24, 2019

nicolasassi commented Jun 16, 2021

lipoja commented Jun 16, 2021

nicolasassi commented Jun 16, 2021 •

edited

lipoja commented Jun 16, 2021

Getting wrong URL when there is dot before url #36

Getting wrong URL when there is dot before url #36

Comments

warproxxx commented Mar 3, 2019

lipoja commented Mar 3, 2019

warproxxx commented Mar 4, 2019

lipoja commented Mar 5, 2019

Larrax commented Jun 24, 2019

nicolasassi commented Jun 16, 2021

lipoja commented Jun 16, 2021

nicolasassi commented Jun 16, 2021 • edited

lipoja commented Jun 16, 2021

nicolasassi commented Jun 16, 2021 •

edited