Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve playground proxy and/or playground document loader #813

Open
davidlehn opened this issue Aug 2, 2023 · 3 comments
Open

Improve playground proxy and/or playground document loader #813

davidlehn opened this issue Aug 2, 2023 · 3 comments

Comments

@davidlehn
Copy link
Member

The playground has a proxy to help load HTTP contexts from the HTTPS playground:
https://github.com/json-ld/json-ld.org/blob/main/playground/proxy.php

There are some issues with this:

  • The proxy could be abused. It should have more checks to make it only useful for the playground.
  • It might have security issues. (See Prevent potentially dangerous behaviour within proxy script #754)
  • It doesn't work well!
  • It currently follows redirects via curl. But it only returns content, not all headers. In some cases this might work, others not so much.
  • A current failure case is HTTP schema.org which redirects to HTTPS then returns HTML with a link header. That header and others are not returned. And if they were, the link target ref is to a relative file to schema.org. The current code and XHR doc loader that rewrites the target URL to a proxy URL would interpret the link target as a relative URL to the playground. There are multiple problems here at different levels.
  • The current fix for the schema.org issue above is to rewrite that particular HTTP URL to HTTPS. But other sites with similar issues would fail.

Ideally the proxy would not be needed, but if the playground is to be HTTPS, then a workaround to load HTTP resources is needed.

I think the longer term fixes that are needed are:

  • Simplify proxy to only do a single request and return what it gets. Do not follow redirects, let the caller handle that.
  • Improve proxy to only handle content types the playground needs. At least via headers, but maybe content inspection too.
  • Add other proxy features to make it only useful for the playground.
  • Update XHR document loader with some of the node features to handle redirects if needed. Also consider either a native fetch doc loader, or ensure the node one works in a browser since it's indirectly based on fetch API now.
  • May need to make doc loaders proxy aware so link targets work.
  • Improve the special rewrite rule for schema.org to be more general in case it's needed for other situations.
  • Ensure cache headers get passed through and are used properly.

See also: #798

@davidlehn
Copy link
Member Author

Note to future self:

  • To handle headers in PHP:
    • https://stackoverflow.com/a/41135574/151401
    • If FOLLOWLOCATION is used, the above code as-is would append each header to a list only when present, which can result in each header entry having an unknown length and it's impossible to know which request resulted in which header. Look into avoiding all that and not using FOLLOWLOCATION.

@gkellogg
Copy link
Member

gkellogg commented Aug 2, 2023

I think you can probably white-list the proxy to only work with schema.org and that will handle 99% of cases involving redirection or HTTP headers. Of course, it's good practice to handle common requests locally to avoid web traffic, but that requires some mechanism to keep the local versions up to date. But, IIRC jsonld.js has a way to use a local cache of loaded contexts.

@davidlehn
Copy link
Member Author

The playground rewrites schema.org specifically to always be https, so the proxy isn't even used for that. (At least not for the common base domain URL use case).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants