Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doesn't checks for valid termination #13

Open
ankitxjoshi opened this issue Apr 9, 2018 · 5 comments
Open

Doesn't checks for valid termination #13

ankitxjoshi opened this issue Apr 9, 2018 · 5 comments
Assignees
Labels

Comments

@ankitxjoshi
Copy link

ankitxjoshi commented Apr 9, 2018

For the following input:

from urlextract import URLExtract

extractor = URLExtract()
text="""
http://httpbin.org/status/204, http://httpbin.org/status/204.
"""
urls = extractor.find_urls(text)
print(urls)

The output generated is:
['http://httpbin.org/status/204,', 'http://httpbin.org/status/204.']

The set [.,?!-] are not valid terminal symbols for the url and thus should be checked.

@lipoja
Copy link
Owner

lipoja commented Apr 9, 2018

Hi thank you for this note. I will think about it.
First what I thought is that those characters can be at the end of URL if the URL has query part.
http://example.com/status?bracket=[
Or am I wrong?
Maybe valid URL should be encoded with % notation but this human readable form of URL you can find in any text.

So it means add more logic and check for these end characters only if the URL does not have query part.

@ankitxjoshi
Copy link
Author

ankitxjoshi commented Apr 9, 2018

Oh [] were just meant to enclose the characters. The invalid symbols are ".,?!-" (Quotes not included). Sorry for the misunderstanding. This is as per my research done. Could be wrong 😅

@lipoja
Copy link
Owner

lipoja commented Apr 9, 2018

OK, thanks. I will look to all your reported issues.

@lipoja lipoja added this to the 0.9 milestone Apr 22, 2018
@lipoja
Copy link
Owner

lipoja commented Aug 29, 2018

Hi @MacBox7,
sorry for such a big delay :(
Could you please help me wit this issue. Especially with the part where it is defined that I can not use those symbols as termination characters? I've read the RFC3986 and I think it is not there specified. Maybe I missed something?

I think that I am still not able to say, from the example above if ',' or '.' characters should or should not be part of the URL.

Thank you!

@lipoja lipoja removed this from the 0.9 milestone Jan 24, 2019
@karlicoss
Copy link
Contributor

karlicoss commented Feb 26, 2019

In the meantime, one can hack (at least commas) via

    u = URLExtract()
    u._stop_chars_right |= {','}
    u._stop_chars_left  |= {','}

Perhaps sensible default is treating unconventional special characters as forbidden in url and adding a nicer constructor argument to allow to configure that if someone really wants them in URL?

vezeli added a commit to vezeli/maillinter that referenced this issue Sep 24, 2019
URL regex pattern is introduced in e39e5ee commit. For extracting URLs
from `Email` content, the pattern preforms better than the URLExtract
because the syntax for links is well defined (.md syntax) and URLExtract
has problems with termination,
see lipoja/URLExtract#13.
@jayvdb jayvdb mentioned this issue Apr 5, 2020
@lipoja lipoja added the bug label Apr 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants