Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do not pipe non-json data through json.dumps #910

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sgsaenger
Copy link

fixes #904 and possible similar errors by treating json data explicitely

@mgeier
Copy link
Contributor

mgeier commented Nov 13, 2018

I'm not sure if this makes sense here, but instead of mime_type == 'application/json' you might want to consider using this regular expression:

^application/(.*\+)?json$

https://github.com/jupyter/nbformat/blob/5c18151bab30c4b6c81da279812bf1434ec54271/nbformat/v4/nbformat.v4.schema.json#L407

@sgsaenger
Copy link
Author

I don't even know how i would end up with plain json-data in my notebooks, therefore I'd rather not add any complexity I can't judge myself.

@mgeier
Copy link
Contributor

mgeier commented Nov 20, 2018

AFAICT, Jupyter widgets use JSON to store their data in the notebook file. They use MIME types like application/vnd.jupyter.widget-view+json, which are matched by the regular expression mentioned above.

An example notebook: https://github.com/jupyter-widgets/ipywidgets/blob/master/docs/source/examples/Widget%20List.ipynb

I don't know if those output data should be ignored by the ExtractOutputPreprocessor, though.

If not, they should be added extract_output_types, too.

@sgsaenger
Copy link
Author

At least in this example, I'd wager it should explicitly not be extracted. (What can the user want with the model_id that seems to be for communication with the widget-handler?)
But my intention was just fixing #904, maybe @akhmerov and @minrk who handled the PR #847 that introduced the json-handling can have a look at this?

@akhmerov
Copy link
Member

I'm quite confused why a byte string ended up in the notebook data to be honest. I thought according to the nbformat everything is a JSON data type, so either a string or another JSON. In that case the fix would be to ensure that the data is read as a string instead.

The problem with the spec is that it's a bit ambiguous: JSON output may consist of a single string. That's why the logic I applied in determining the output type is to check whether it's not a string, and therefore must be JSON, or a string, in which case it may be a JSON and we have to check the mime type.

If for some reason we do expect that byte arrays may appear in cell outputs, I believe a better fix would be to include those in the check if isinstance(data, (text_type, bytes)): ....

In addition to ipywidgets storing their outputs as JSON, so does plotly (its mime type is application/vnd.plotly.v1+json) and ipython.display.JSON (jupyterlab even has a nice rendering for it).

@sgsaenger
Copy link
Author

sgsaenger commented Nov 21, 2018

I guess you're right about the specs. The file itself even says that it expects images to be enclosed in a string:

# data is b64-encoded as text (str, unicode),

Still, i'm not sold on your solution.
We know the mime type before we reach this part of the code. By the contract of this function, we trust the mime type (else we wouldn't be here. Note that you omitted including json in the list of accepted mime types). So we perform some specific actions on the data, based on the mime_type.
Previously, we didn't care much for the actual python types handled; if I understand you correctly, you also don't really care whether you have a string or non-string, only whether we should treat the object as json, else you wouldn't include the ´isinstance´-dance.
So where do you see a problem in just checking the mime type against ´application/json´?
The ´isinstance´ call ONLY comes into play if the data is jpg, png, pdf or svg (or anything else that may be included in the future), shouldn't this be independent from json-specific code-paths?

@akhmerov
Copy link
Member

Note that you omitted including json in the list of accepted mime types).

The list of mimetypes accepted by default is a separate issue from extracting the output correctly, and therefore I did not touch it. That traitlet isn't a complete list of mime types that should ever be treated by the extract output preprocessor, but merely what the extract output treats by default. Several preprocessors override that list. We therefore may not assume that we are never going to encounter other mimetypes. Specifically your PR would break nbconvert for those who want to extract widget information as pointed by @mgeier, or a plotly plot, which is also JSON.

We know the mime type before we reach this part of the code. By the contract of this function, we trust the mime type (else we wouldn't be here.

That is certainly true, but this bit of code is not about the meaning of the data (represented as mime type), but rather about the storage format of the data. Unfortunately right now the storage format is not explicitly specified in the metadata (see also #858). Roughly speaking, there are 3 options currently:

  • any JSON-like data is stored as JSON without double serialization
  • any other plaintext-ish mime type is stored as a string
  • anything else is base64 encoded

The second and third option would become strings at this point in the pipeline, therefore the logic I am applying is the following;

  • If we encounter any python object other than a string (something like a list/dict), we're dealing with JSON-encoded data (that's the reason for what you called isinstance-dance)
  • Otherwise if we are either dealing with either a single JSON string or the other two options. Storing a single JSON string is rather uncommon, but allowed by the spec, and therefore only in that case I do rely on the explicit mimetype.

That's my understanding of what is happening in the preprocessor, largely based on discussions with @minrk.

What I really do not understand is: why do we ever encounter a byte string! This seems like a bug: JSON doesn't have a way to specify the type of string data, svg may contain arbitrary unicode symbols, and nothing in the spec tells about byte strings. Instead we're base64 encoding anything that isn't unicode.

@sgsaenger
Copy link
Author

That traitlet isn't a complete list of mime types that should ever be treated by the extract output preprocessor, but merely what the extract output treats by default. Several preprocessors override that list.

Ah, totally didn't catch that usage, but fair point.

If we encounter any python object other than a string (something like a list/dict), we're dealing with JSON-encoded data (that's the reason for what you called isinstance-dance)

Ok, I now understand your assumptions. But still, is this the right place to (implicitly) introduce strong typing? I have two issues with this:

  • We're implicitly introducing strong type-dependencies on data we don't care about. If it is important that we more closely match some specs, shouldn't this be done explicitly, and possibly elsewhere?
  • Isn't there a way to do the json-conversion while being both more explicit in our intent and/or being more robust?
    We could e.g. check for json in mime_type to match the widget information, or wrap the dump in try and ignore possible exceptions when the data is not wanted.

@akhmerov
Copy link
Member

Some remarks:

  • The best way to avoid guessing would be WIP: output extractor: infer data encoding from output metadata #858.
  • For now if we let something that doesn't look like a string go further down the pipeline, the follow-up code will break. This would happen with this PR and a mime-type that we don't explicitly catch. I don't like strong typing just likewise, but here I don't see a good reason to avoid it. (Obviously there's something subtle happening with bytes, so I'm not claiming that the current code is correct.)
  • I don't know enough about mimetypes combined with how jupyter decides which data storage format to apply to a datatype. I am therefore not certain that a check 'json' in mimetype is safe. Maybe some json-related mimetypes would somehow be still string-serialized and we would end up double-serializing them? Maybe some mimetypes that don't contain JSON in their name are stored as a JSON data structure?

@MSeal
Copy link
Contributor

MSeal commented Apr 13, 2019

Related to this topic, I ended up tackling the application.json vs other data type paths in #985. It doesn't expand the scope of data transformation but fixed a bug where latex exports ended up in the data.decode path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use of SVG generates TypeError: Object of type 'bytes' is not JSON serializable
4 participants