Filename extracted as URL #43

Larrax · 2019-06-24T07:52:16Z

From the following input (which is a legit archive filename):

PAYMENT EUR 1,420.00.zip

URL is extracted using find_urls:

1,420.00.zip

The text was updated successfully, but these errors were encountered:

voldmar · 2019-10-07T11:15:42Z

.zip is the valid TLD, according to public suffix list. But the comma should not be the the part of domain I guess

gleb-shnshn · 2019-10-26T11:05:28Z

#47 fixed

lipoja · 2019-10-26T11:23:36Z

First of all thank you for your time working on the patch. However I am not quite sure about the fix, please have a look on my comment: #47 (comment)

And lets discuss this topic a bit maybe we can agree on something.
Thank you!

andreys42 · 2024-05-07T12:21:06Z

as growth point : you can add some probability in extraction algorith, for example to decrease false positive rate you can use frequency of TLD usage as attribute (weight), so that probability of detected domain (for example Apple.Inc) would be lower than one that uses popular TLD (Apple.com/net/com and so on).
Finally, users can choose threshold of this prob value that fits the best their purposes ...

lipoja linked a pull request Mar 24, 2020 that will close this issue

Fixed Issue #43 #47

Closed

jayvdb mentioned this issue Apr 5, 2020

pypidb issues #68

Open

lipoja added the low label Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filename extracted as URL #43

Filename extracted as URL #43

Larrax commented Jun 24, 2019 •

edited

voldmar commented Oct 7, 2019

gleb-shnshn commented Oct 26, 2019

lipoja commented Oct 26, 2019

andreys42 commented May 7, 2024 •

edited

Filename extracted as URL #43

Filename extracted as URL #43

Comments

Larrax commented Jun 24, 2019 • edited

voldmar commented Oct 7, 2019

gleb-shnshn commented Oct 26, 2019

lipoja commented Oct 26, 2019

andreys42 commented May 7, 2024 • edited

Larrax commented Jun 24, 2019 •

edited

andreys42 commented May 7, 2024 •

edited