Doesn't checks for valid termination #13

ankitxjoshi · 2018-04-09T05:55:04Z

For the following input:

from urlextract import URLExtract

extractor = URLExtract()
text="""
http://httpbin.org/status/204, http://httpbin.org/status/204.
"""
urls = extractor.find_urls(text)
print(urls)

The output generated is:
['http://httpbin.org/status/204,', 'http://httpbin.org/status/204.']

The set [.,?!-] are not valid terminal symbols for the url and thus should be checked.

The text was updated successfully, but these errors were encountered:

lipoja · 2018-04-09T06:11:17Z

Hi thank you for this note. I will think about it.
First what I thought is that those characters can be at the end of URL if the URL has query part.
http://example.com/status?bracket=[
Or am I wrong?
Maybe valid URL should be encoded with % notation but this human readable form of URL you can find in any text.

So it means add more logic and check for these end characters only if the URL does not have query part.

ankitxjoshi · 2018-04-09T06:14:20Z

Oh [] were just meant to enclose the characters. The invalid symbols are ".,?!-" (Quotes not included). Sorry for the misunderstanding. This is as per my research done. Could be wrong 😅

lipoja · 2018-04-09T06:16:18Z

OK, thanks. I will look to all your reported issues.

lipoja · 2018-08-29T18:56:11Z

Hi @MacBox7,
sorry for such a big delay :(
Could you please help me wit this issue. Especially with the part where it is defined that I can not use those symbols as termination characters? I've read the RFC3986 and I think it is not there specified. Maybe I missed something?

I think that I am still not able to say, from the example above if ',' or '.' characters should or should not be part of the URL.

Thank you!

karlicoss · 2019-02-26T22:11:47Z

In the meantime, one can hack (at least commas) via

    u = URLExtract()
    u._stop_chars_right |= {','}
    u._stop_chars_left  |= {','}

Perhaps sensible default is treating unconventional special characters as forbidden in url and adding a nicer constructor argument to allow to configure that if someone really wants them in URL?

URL regex pattern is introduced in e39e5ee commit. For extracting URLs from `Email` content, the pattern preforms better than the URLExtract because the syntax for links is well defined (.md syntax) and URLExtract has problems with termination, see lipoja/URLExtract#13.

ankitxjoshi mentioned this issue Apr 9, 2018

URLBear: use library to extract links coala/coala-bears#1342

Open

lipoja self-assigned this Apr 9, 2018

lipoja added this to the 0.9 milestone Apr 22, 2018

lipoja removed this from the 0.9 milestone Jan 24, 2019

karlicoss mentioned this issue Feb 26, 2019

Fix some issues: comma detection, nondeterministic behaviour #34

Closed

vezeli mentioned this issue Dec 1, 2019

Improving tests urlstechie/urlchecker-action#2

Closed

jayvdb mentioned this issue Apr 5, 2020

pypidb issues #68

Open

lipoja added the bug label Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Doesn't checks for valid termination #13

Doesn't checks for valid termination #13

ankitxjoshi commented Apr 9, 2018 •

edited

lipoja commented Apr 9, 2018

ankitxjoshi commented Apr 9, 2018 •

edited

lipoja commented Apr 9, 2018

lipoja commented Aug 29, 2018 •

edited

karlicoss commented Feb 26, 2019 •

edited

Doesn't checks for valid termination #13

Doesn't checks for valid termination #13

Comments

ankitxjoshi commented Apr 9, 2018 • edited

lipoja commented Apr 9, 2018

ankitxjoshi commented Apr 9, 2018 • edited

lipoja commented Apr 9, 2018

lipoja commented Aug 29, 2018 • edited

karlicoss commented Feb 26, 2019 • edited

ankitxjoshi commented Apr 9, 2018 •

edited

ankitxjoshi commented Apr 9, 2018 •

edited

lipoja commented Aug 29, 2018 •

edited

karlicoss commented Feb 26, 2019 •

edited