pypidb issues #68

jayvdb · 2020-04-05T01:28:00Z

Continuing from #63 , these are the known issues (list will grow).

As a general rule, the higher priority issues are where urlextract doesnt extract valuable urls, or extracts truncated urls. Returning extra junk around urls or extra urls is problematic, but I can trim/remove junk. I cant fix data I dont have.

HTML character references &foo; are cut at the semi-colon #62 (high priority bug)
VCS/Git remote URLs not found #67 (medium priority enhancement)
Getting wrong URL when there is dot before url #36 (medium priority bug)
Filename extracted as URL #43 (annoying)
Doesn't checks for valid termination #13 (bug, e.g. https://pypi.org/project/ebcdic/ )

Others I think are harder and may not be in urlextract scope:

Lots of annoying invalid .py domains filtered out by DNS checking, such as setup.py which is assumed to be https://setup.py, https://manifest.py, etc. This is a significant performance problem for the first few requests, as they are DNS negatives which need to get cached, and they slow down urlextract also. Lots of other country codes occasionally correlate with file extensions, such as https://manifest.in/ and http://readme.md/. This could be handled in dns_cache by seeding the DNS cache with known invalid entries. urlextract could help with domain name filtering.
Relative urls Relative urls jayvdb/pypidb#38 This would be a huge enhancement to URLExtract, but requires adding a completely different extraction algorithm.
DOS/Maximum results Maximum results #69
http://docs.red-dove.com/cfg/python.html e.target is really common, appearing in <script> blocks, but I am not sure it would be useful to exclude urls found in script tags via https://pypi.org/project/config

{{ in url ; pydevd-pycharm

DEBUG    pypidb._pypi:_pypi.py:313 processing Webpage: https://ci.appveyor.com/project/fabioz/pydev-debugger
DEBUG    pypidb._pypi:_pypi.py:379 @@ ran <function _url_extractor_wrapper at 0x7f03e2f1b5e0> on text size 7901 for 8 urls !!
DEBUG    pypidb._pypi:_pypi.py:384 extracted ['account.name', 'https://help.appveyor.com/', 'https://js.stripe.com/v2/', 'https://status.appveyor.com/', 'https://www.appveyor.com/docs/', 'https://www.appveyor.com/docs/server/', 'https://www.appveyor.com/updates/', 'https://www.gravatar.com/avatar/{{Session.user().gravatarHash}}?d=https%3a%2f%2fci.appveyor.com%2fassets%2fimages%2fuser.png&s=40']

backticks are not trimmed, related to Doesn't checks for valid termination #13

'git://github.com/ingydotnet/package-py.git``'

so I use

_scm_url_cleaner.py:                repo = repo.strip("`")

The text was updated successfully, but these errors were encountered:

Related to lipoja/URLExtract#68

jayvdb mentioned this issue Apr 5, 2020

dns_cache #63

Open

jayvdb added a commit to jayvdb/pypidb that referenced this issue Apr 6, 2020

SCM picker: Add three URLExtract issue links

9d3b0f1

Related to lipoja/URLExtract#68

jayvdb added a commit to jayvdb/pypidb that referenced this issue Apr 7, 2020

SCM picker: Add three URLExtract issue links

18af2b3

Related to lipoja/URLExtract#68

jayvdb added a commit to jayvdb/pypidb that referenced this issue Apr 7, 2020

SCM picker: Add three URLExtract issue links

4754488

Related to lipoja/URLExtract#68

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pypidb issues #68

pypidb issues #68

jayvdb commented Apr 5, 2020 •

edited

pypidb issues #68

pypidb issues #68

Comments

jayvdb commented Apr 5, 2020 • edited

jayvdb commented Apr 5, 2020 •

edited