-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed Issue #43 #47
Fixed Issue #43 #47
Conversation
Please take a look on the patch. It probably broke few things because almost all test failed. And I am not confident about this patch. Please have a look at RFC3986. If I read it correctly comma is valid character that can be in hostname. I am still thinking about the cases which might be correct but we would filter them out. What is your opinion about this? |
Sorry, I hurried a little bit, i got why it is failed, so i' ll try to fix it in time. Hence, i think we need to split url for 2 parts - before slash and after it. And the address is not valid if first part contains commas |
I am not sure about the commas in host part of URL as well.
So this problem is not that easy as it look at first sight. |
I haven't seen before any urls with commas in host part as well as browsers don't recognize them as a part of url, so even if it is possible, their appearance is too rare to keep in touch with them in extractor. |
Should we somehow consider also inputs like this:
In this case it would be nice if we could return I went through the code to remind me how is it extracted. And I do not already support commas in domain name. But subdomains are little bit different. I agree with you that comma should not be in the domain name but what about subdomains? Should we treat with it as domain name and do not allow commas? Or what about |
Is there any example of subdomains which contain commas? I think 'ftp://login:pass,word@example.com' is appropriate input, and i'll try to patch it, but to fix 'This is text with URL right after comma,subdomain.example.com' the whole proccess logic should be reconsidered and i don't know how |
@gleb270 Hello, I am sorry for being such an unreachable maintainer. Could I ask you a favor? Would you be so kind and resolve the conflicts. Thank you! |
@@ -35,6 +35,9 @@ def test_extract_email_disabled(urlextract, text, expected): | |||
("<email@address.net>", | |||
['email@address.net']), | |||
|
|||
("email with comma ema,il@address.net", | |||
[]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if it should return empty list.
I think about returning something like: [il@address.net]
What is your opinion?
@@ -15,6 +15,12 @@ | |||
|
|||
("Let's have text without URLs.", | |||
[]), | |||
|
|||
("Comma in URL 1,420.00.zip", | |||
[]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here as I commented with email.
I would prefer to get [420.00.zip]
as URL rather than empty list
|
||
# emails don't have schemes | ||
if self.extract_email and not added_schema: | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer not to throw away whole domain.
Hi @gleb270, please have a look on my comments. Could we discuss this topic little bit more? What I mentioned there is that, I do not think that filtering everything out is a good choice. My point of view is to get users the ability to tune and tweak this library by settings. Therefore I was always aiming to extract more rather then less. And then user can processes extracted URLs and/or tune this library by setting stop characters to fit his needs. |
Hi @jayvdb, since you are the heaviest contributor these times I would like to know your opinion on this PR and the issue in general. |
This would make URLExtract unusable for my use case. |
I am closing this PR since there is no progress and right now it is not ideal solution and it may introduce issues. |
I have added checking for commas and test for this case