Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML character references &foo; are cut at the semi-colon #62

Open
jayvdb opened this issue Feb 27, 2020 · 2 comments
Open

HTML character references &foo; are cut at the semi-colon #62

jayvdb opened this issue Feb 27, 2020 · 2 comments

Comments

@jayvdb
Copy link
Contributor

jayvdb commented Feb 27, 2020

A URL containing an XML entity/HTML character reference, such as http://.../..?foo&bar;baz, will be cut at the semi-colon.

@jayvdb jayvdb mentioned this issue Apr 5, 2020
@jayvdb
Copy link
Contributor Author

jayvdb commented Apr 5, 2020

Another is ' in https://docs.red-dove.com/cfg/python.html results in a 404 at https://freeotp.github.io/&#39 (the ; is omitted, but it is the & which causes the 404 as it doesnt follow a ?)

@lipoja lipoja added this to the 0.15.0 milestone Apr 11, 2020
@jayvdb
Copy link
Contributor Author

jayvdb commented Apr 13, 2020

It might useful to have the caller inform the parser what type of text is being provided, such as html, xml, md, rst, which give clues to the parser when it is trying to find the start and end of urls, and what decoding to perform on the url.

Or have a hook in the class which is given the location of the hostname, so the hook can decide the start and end of the url which surrounds the hostname. Then I could override the URLExtract class several times to implement this hook for various doctypes. _complete_url almost does this, but it would need to be a public member of the API.

@lipoja lipoja modified the milestones: 1.0.0, 1.1.0 Jun 20, 2020
@lipoja lipoja removed this from the 1.1.0 milestone Oct 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants