Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Issue #43 #47

Closed
wants to merge 12 commits into from
Closed

Fixed Issue #43 #47

wants to merge 12 commits into from

Conversation

gleb-shnshn
Copy link

I have added checking for commas and test for this case

@lipoja
Copy link
Owner

lipoja commented Oct 26, 2019

Please take a look on the patch. It probably broke few things because almost all test failed.

And I am not confident about this patch. Please have a look at RFC3986. If I read it correctly comma is valid character that can be in hostname. I am still thinking about the cases which might be correct but we would filter them out.

What is your opinion about this?

@gleb-shnshn
Copy link
Author

Sorry, I hurried a little bit, i got why it is failed, so i' ll try to fix it in time.
And i got that my solution kinda not correct, bc commas are used in afterpath.
I mean the place in the url after / - http://www.sample.com/forum/read.php?13,35869 .
I didn't see other ways to use it in url.

Hence, i think we need to split url for 2 parts - before slash and after it. And the address is not valid if first part contains commas

@lipoja
Copy link
Owner

lipoja commented Oct 26, 2019

I am not sure about the commas in host part of URL as well.
Form RFC:

host         = IP-literal / IPv4address / reg-name
reg-name     = *( unreserved / pct-encoded / sub-delims )
sub-delims   = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

So this problem is not that easy as it look at first sight.

@gleb-shnshn
Copy link
Author

I haven't seen before any urls with commas in host part as well as browsers don't recognize them as a part of url, so even if it is possible, their appearance is too rare to keep in touch with them in extractor.

@lipoja
Copy link
Owner

lipoja commented Oct 28, 2019

Should we somehow consider also inputs like this:

This is text with URL right after comma,subdomain.example.com

In this case it would be nice if we could return subdomain.example.com and not just skip it.

I went through the code to remind me how is it extracted. And I do not already support commas in domain name. But subdomains are little bit different. I agree with you that comma should not be in the domain name but what about subdomains? Should we treat with it as domain name and do not allow commas?

Or what about ftp://login:pass,word@example.com ?

@gleb-shnshn
Copy link
Author

Is there any example of subdomains which contain commas? I think 'ftp://login:pass,word@example.com' is appropriate input, and i'll try to patch it, but to fix 'This is text with URL right after comma,subdomain.example.com' the whole proccess logic should be reconsidered and i don't know how

@lipoja
Copy link
Owner

lipoja commented Mar 24, 2020

@gleb270 Hello, I am sorry for being such an unreachable maintainer. Could I ask you a favor? Would you be so kind and resolve the conflicts. Thank you!

@lipoja lipoja added this to the 0.15.0 milestone Mar 24, 2020
@lipoja lipoja linked an issue Mar 24, 2020 that may be closed by this pull request
@@ -35,6 +35,9 @@ def test_extract_email_disabled(urlextract, text, expected):
("<email@address.net>",
['email@address.net']),

("email with comma ema,il@address.net",
[]),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if it should return empty list.
I think about returning something like: [il@address.net]

What is your opinion?

@@ -15,6 +15,12 @@

("Let's have text without URLs.",
[]),

("Comma in URL 1,420.00.zip",
[]),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here as I commented with email.
I would prefer to get [420.00.zip] as URL rather than empty list


# emails don't have schemes
if self.extract_email and not added_schema:
return False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer not to throw away whole domain.

@lipoja
Copy link
Owner

lipoja commented Apr 8, 2020

Hi @gleb270, please have a look on my comments. Could we discuss this topic little bit more? What I mentioned there is that, I do not think that filtering everything out is a good choice.

My point of view is to get users the ability to tune and tweak this library by settings. Therefore I was always aiming to extract more rather then less. And then user can processes extracted URLs and/or tune this library by setting stop characters to fit his needs.

@lipoja
Copy link
Owner

lipoja commented Apr 8, 2020

Hi @jayvdb, since you are the heaviest contributor these times I would like to know your opinion on this PR and the issue in general.

@lipoja lipoja removed this from the 0.15.0 milestone Apr 11, 2020
@jayvdb
Copy link
Contributor

jayvdb commented Jun 20, 2020

This would make URLExtract unusable for my use case.
I expect URLExtract to give me more rather than less. I can post-process for validity based on the application needs. URLExtract looses its value if I need add my own extraction for potential hits that URLExtract omits.

@lipoja
Copy link
Owner

lipoja commented Jan 5, 2021

I am closing this PR since there is no progress and right now it is not ideal solution and it may introduce issues.

@lipoja lipoja closed this Jan 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filename extracted as URL
3 participants