Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pypidb issues #68

Open
jayvdb opened this issue Apr 5, 2020 · 0 comments
Open

pypidb issues #68

jayvdb opened this issue Apr 5, 2020 · 0 comments

Comments

@jayvdb
Copy link
Contributor

jayvdb commented Apr 5, 2020

Continuing from #63 , these are the known issues (list will grow).

As a general rule, the higher priority issues are where urlextract doesnt extract valuable urls, or extracts truncated urls. Returning extra junk around urls or extra urls is problematic, but I can trim/remove junk. I cant fix data I dont have.

Others I think are harder and may not be in urlextract scope:

  • Lots of annoying invalid .py domains filtered out by DNS checking, such as setup.py which is assumed to be https://setup.py, https://manifest.py, etc. This is a significant performance problem for the first few requests, as they are DNS negatives which need to get cached, and they slow down urlextract also. Lots of other country codes occasionally correlate with file extensions, such as https://manifest.in/ and http://readme.md/. This could be handled in dns_cache by seeding the DNS cache with known invalid entries. urlextract could help with domain name filtering.
  • Relative urls Relative urls jayvdb/pypidb#38 This would be a huge enhancement to URLExtract, but requires adding a completely different extraction algorithm.
  • DOS/Maximum results Maximum results #69
  • http://docs.red-dove.com/cfg/python.html e.target is really common, appearing in <script> blocks, but I am not sure it would be useful to exclude urls found in script tags via https://pypi.org/project/config
  • {{ in url ; pydevd-pycharm
    DEBUG    pypidb._pypi:_pypi.py:313 processing Webpage: https://ci.appveyor.com/project/fabioz/pydev-debugger
    DEBUG    pypidb._pypi:_pypi.py:379 @@ ran <function _url_extractor_wrapper at 0x7f03e2f1b5e0> on text size 7901 for 8 urls !!
    DEBUG    pypidb._pypi:_pypi.py:384 extracted ['account.name', 'https://help.appveyor.com/', 'https://js.stripe.com/v2/', 'https://status.appveyor.com/', 'https://www.appveyor.com/docs/', 'https://www.appveyor.com/docs/server/', 'https://www.appveyor.com/updates/', 'https://www.gravatar.com/avatar/{{Session.user().gravatarHash}}?d=https%3a%2f%2fci.appveyor.com%2fassets%2fimages%2fuser.png&s=40']
    
  • backticks are not trimmed, related to Doesn't checks for valid termination #13
    'git://github.com/ingydotnet/package-py.git``' 
    
    so I use
    _scm_url_cleaner.py:                repo = repo.strip("`")
    
@jayvdb jayvdb mentioned this issue Apr 5, 2020
jayvdb added a commit to jayvdb/pypidb that referenced this issue Apr 6, 2020
jayvdb added a commit to jayvdb/pypidb that referenced this issue Apr 7, 2020
jayvdb added a commit to jayvdb/pypidb that referenced this issue Apr 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant