Fixed Issue #43 #47

gleb-shnshn · 2019-10-26T11:04:38Z

I have added checking for commas and test for this case

lipoja · 2019-10-26T11:21:01Z

Please take a look on the patch. It probably broke few things because almost all test failed.

And I am not confident about this patch. Please have a look at RFC3986. If I read it correctly comma is valid character that can be in hostname. I am still thinking about the cases which might be correct but we would filter them out.

What is your opinion about this?

gleb-shnshn · 2019-10-26T11:34:11Z

Sorry, I hurried a little bit, i got why it is failed, so i' ll try to fix it in time.
And i got that my solution kinda not correct, bc commas are used in afterpath.
I mean the place in the url after / - http://www.sample.com/forum/read.php?13,35869 .
I didn't see other ways to use it in url.

Hence, i think we need to split url for 2 parts - before slash and after it. And the address is not valid if first part contains commas

lipoja · 2019-10-26T16:52:20Z

I am not sure about the commas in host part of URL as well.
Form RFC:

host         = IP-literal / IPv4address / reg-name
reg-name     = *( unreserved / pct-encoded / sub-delims )
sub-delims   = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

So this problem is not that easy as it look at first sight.

gleb-shnshn · 2019-10-27T10:04:22Z

I haven't seen before any urls with commas in host part as well as browsers don't recognize them as a part of url, so even if it is possible, their appearance is too rare to keep in touch with them in extractor.

lipoja · 2019-10-28T08:29:04Z

Should we somehow consider also inputs like this:

This is text with URL right after comma,subdomain.example.com

In this case it would be nice if we could return subdomain.example.com and not just skip it.

I went through the code to remind me how is it extracted. And I do not already support commas in domain name. But subdomains are little bit different. I agree with you that comma should not be in the domain name but what about subdomains? Should we treat with it as domain name and do not allow commas?

Or what about ftp://login:pass,word@example.com ?

gleb-shnshn · 2019-10-28T08:56:11Z

Is there any example of subdomains which contain commas? I think 'ftp://login:pass,word@example.com' is appropriate input, and i'll try to patch it, but to fix 'This is text with URL right after comma,subdomain.example.com' the whole proccess logic should be reconsidered and i don't know how

lipoja · 2020-03-24T21:11:01Z

@gleb270 Hello, I am sorry for being such an unreachable maintainer. Could I ask you a favor? Would you be so kind and resolve the conflicts. Thank you!

lipoja · 2020-04-08T21:33:29Z

tests/unit/test_extract_email.py

@@ -35,6 +35,9 @@ def test_extract_email_disabled(urlextract, text, expected):
    ("<email@address.net>",
     ['email@address.net']),

+    ("email with comma ema,il@address.net",
+     []),


I am not sure if it should return empty list.
I think about returning something like: [il@address.net]

What is your opinion?

lipoja · 2020-04-08T21:34:19Z

tests/unit/test_find_urls.py

@@ -15,6 +15,12 @@

    ("Let's have text without URLs.",
     []),
+
+    ("Comma in URL 1,420.00.zip",
+     []),


Same here as I commented with email.
I would prefer to get [420.00.zip] as URL rather than empty list

lipoja · 2020-04-08T21:38:02Z

urlextract/urlextract_core.py

+
+        # emails don't have schemes 
+        if self.extract_email and not added_schema:
+            return False


I would prefer not to throw away whole domain.

lipoja · 2020-04-08T21:40:20Z

Hi @gleb270, please have a look on my comments. Could we discuss this topic little bit more? What I mentioned there is that, I do not think that filtering everything out is a good choice.

My point of view is to get users the ability to tune and tweak this library by settings. Therefore I was always aiming to extract more rather then less. And then user can processes extracted URLs and/or tune this library by setting stop characters to fit his needs.

lipoja · 2020-04-08T21:43:31Z

Hi @jayvdb, since you are the heaviest contributor these times I would like to know your opinion on this PR and the issue in general.

jayvdb · 2020-06-20T23:51:02Z

This would make URLExtract unusable for my use case.
I expect URLExtract to give me more rather than less. I can post-process for validity based on the application needs. URLExtract looses its value if I need add my own extraction for potential hits that URLExtract omits.

lipoja · 2021-01-05T08:50:06Z

I am closing this PR since there is no progress and right now it is not ideal solution and it may introduce issues.

gleb-shnshn added 2 commits October 26, 2019 18:00

fix commas and other punctuation

e0ca25f

fix test

273e2b6

gleb-shnshn mentioned this pull request Oct 26, 2019

Filename extracted as URL #43

Open

gleb-shnshn and others added 4 commits October 26, 2019 20:36

changing re

305ab34

add email compatability

4b58bf5

delete debug lines

9aa0b4c

fix processing credentials

be0b5a2

gleb-shnshn added 5 commits October 28, 2019 16:09

fix processing of credentials

b8cd8b2

fix of fix

f1a4d93

restructure checking

369bb5b

fix ipv4 processing

3d2be2f

fix none errors

37ee6ad

lipoja added this to the 0.15.0 milestone Mar 24, 2020

lipoja linked an issue Mar 24, 2020 that may be closed by this pull request

Filename extracted as URL #43

Open

Merge branch 'master' into master

07e2cd1

lipoja requested changes Apr 8, 2020

View reviewed changes

lipoja removed this from the 0.15.0 milestone Apr 11, 2020

lipoja closed this Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed Issue #43 #47

Fixed Issue #43 #47

gleb-shnshn commented Oct 26, 2019

lipoja commented Oct 26, 2019

gleb-shnshn commented Oct 26, 2019

lipoja commented Oct 26, 2019

gleb-shnshn commented Oct 27, 2019

lipoja commented Oct 28, 2019

gleb-shnshn commented Oct 28, 2019

lipoja commented Mar 24, 2020

lipoja Apr 8, 2020

lipoja Apr 8, 2020

lipoja Apr 8, 2020

lipoja commented Apr 8, 2020

lipoja commented Apr 8, 2020

jayvdb commented Jun 20, 2020

lipoja commented Jan 5, 2021

Fixed Issue #43 #47

Fixed Issue #43 #47

Conversation

gleb-shnshn commented Oct 26, 2019

lipoja commented Oct 26, 2019

gleb-shnshn commented Oct 26, 2019

lipoja commented Oct 26, 2019

gleb-shnshn commented Oct 27, 2019

lipoja commented Oct 28, 2019

gleb-shnshn commented Oct 28, 2019

lipoja commented Mar 24, 2020

lipoja Apr 8, 2020

Choose a reason for hiding this comment

lipoja Apr 8, 2020

Choose a reason for hiding this comment

lipoja Apr 8, 2020

Choose a reason for hiding this comment

lipoja commented Apr 8, 2020

lipoja commented Apr 8, 2020

jayvdb commented Jun 20, 2020

lipoja commented Jan 5, 2021