question about parsing bytes in dag-json #302

warpfork · 2020-09-24T19:57:53Z

I'm trying to implement a single-pass (no backtracking; no re-parsing or multipass hijinx) no-extraneous-allocations (I can do lookahead, but only if it's a fixed amount that's in no way data-dependent) parser for dag-json. I'm having difficulty reconciling this goal with how dag-json specifies the serialization of bytes.

The dag-json specs currently describe bytes as being serialized as:

{"/": { "bytes": String /* Multibase Base64 encoded binary */ }}

This is somewhat problematic. Critically, there's no definition of when a parser may decide to reject this data as invalid dag-json, versus when deviation from this pattern must result in parsing as maps.

Assuming that we need to validate each of:

there's a map where the first key is "/"
the value for that key contains a map
the first key in this deeper map is "bytes"
its value is a string
there are no more elements in this map
there are no more elements in the outer map either

... then that's six tokens the parser needs to buffer before it can decide for sure that this segment of serial data should be interpreted as bytes.

If we have to support pivoting our interpretation of this data into maps right up until the last minute (say, if the sixth token is another map key in the outer map), then we really do need to buffer this.

Buffering six tokens is problematic: it's over a critical threshhold because we could hit the same buffering need again, before we clear out a buffer that's already started. This lookahead process needs to start each time a map open token is countered. By six tokens in, there's more than enough distance travelled that my parser might have encountered another token sequence which could require six more tokens of buffering to figure out. This can happen recursively. The result? We can't do this lookahead in a fixed amount of memory. (And frankly speaking, the parse logic just gets nasty.)

If we're allowed to reject data that "teased" (if you will) the parser into thinking there might be a bytes pattern coming up -- for example some serial data that continues with more map content on either step 5 or 6 -- then we don't have the potentially-unlimited-buffer problem for our parser implementation; if there's no need to be able backtrack and replay stuff with an alternate interpretation, there's no need to buffer, and thus no difficulty. However, the dag-json spec would need to clearly identify if this is allowed -- and exactly at what steps this is allowed.

Defining exactly which steps are allowed to halt parsing and raise an error versus what steps must cause backtracking and reinterpretation as simple maps is extremely important, because there's a big difference between erroring for e.g. {"/":{"bytes":12 vs erroring for every map with a first entry that's "/" that doesn't turn into a CID or into bytes.

Sidetrack: this isn't actually a problem for the parsing of CIDs... which may be interesting to reflect on.

CIDs serialized in dag-json are document to look like:

{"/": String /* Base58 encoded CIDv0 or Multibase Base32 encoded CIDv1 */}

... which bears some resemblance to the how bytes are described, but has one extremely consequential difference: it takes fewer steps and fewer tokens to see from the beginning through to the end of this pattern.

there's a map where the first key is "/"
the value for that key contains a string
there are no more elements in this map

It's possible to process this pattern without potentially needing to re-engage the pattern recognition again before the previous recognition attempt has halted. You can probably see how: by the time you can first possibly encounter a new map-open token which would require starting the pattern recognition again, you'd also simultaneously have flunked any current pattern recognition, because that position must contain a string for the outer pattern recognition to pass.

go-ipld-prime's dag-json implementation actually supports this; the code is somewhat exciting.

tl;dr the dag-json spec document currently only describes the happy-path for decoding and discovering encoded bytes. What it needs to do is describe the step-by-step logic which can actual perform this recognition, as well as what to do when the recognition derails, at each step of the process where it can derail.

In deciding how each possible branch should be handled -- backtracking and reinterpretation, or erroring -- it may be a good idea for the spec to consider what those choices imply in terms of data that becomes invalid dag-json as well as what those choices imply about the complexity (and efficiency) of dag-json parsers.

The text was updated successfully, but these errors were encountered:

rvagg · 2020-09-25T02:10:33Z

I suppose for background for others viewing this and being a bit confused - I think it's safe to say that the current spec is from the perspective of a double-parse where we have parseDagJSON(parseJSON(rawBytes)) process that's common with JSON. But you're dealing with a parseDagJSON(rawBytes) in ipld-prime (and I'd like to be able to get to this in JS at some point too but it's pretty low priority).

I'm reading this:

If we're allowed to reject data that "teased" (if you will) the parser into thinking there might be a bytes pattern coming up

By suggesting that you'd like to be much more restrictive in the forms dag-json can contain for bytes; specifically, reject as valid dag-json (i.e. halt parsing and throw some error) any cases where the inner map, and perhaps even the outer map, can't contain any other keys if they contain what at first looks like dag-json.

e.g.

{"/": { "bytes": String, "nope": 0 }}
{"/": { "bytes": String }, "gotcha": "nope" }

But not necessarily cases where the sorting puts things in a more convenient order:

{"/": { "/": "nope", "bytes": String }}

Is that right?

I haven't implemented the reader side because I'm not sure it's possible; the specification is insufficiently clear. I opened Issue ipld/specs#302 to track this.

warpfork · 2020-10-01T12:18:25Z

@rvagg also just pointed out to me that the spec states that bytes should be base64 encoded -- specifically, no other multibase is permitted -- and yet also requires that it have the base64 multibase indicator prefixed to it.

This seems redundant, and I question the utility of it. (It makes sense to not support various multibases, because for one, almost no one would use it, and for two (much more importantly), it would add a form of baroqueness that would be hard to round-trip through the Data Model) -- but then if it's not a parameter, why encode the extra byte?)

But also, if we are going to require that multibase indicator, the spec should clarify what the expected behavior is for other values encountered there.

It's currently very likely for someone to skim the spec and be caught off-guard by that almost-entirely-superfluous byte, I'm afraid.

vmx · 2020-10-01T12:39:25Z

This seems redundant, and I question the utility of it.

I guess the idea is that we like self-described data over spec'ed data? Without the prefix it's Base64 encoded bytes, with the byte it's a self-describing Multibase encoded thing.

rvagg · 2020-10-01T12:45:18Z

with the byte it's a self-describing Multibase encoded thing

Except that we mandate a particular base, so it's only one of them. The main utility of this, aside from "we like it", would seem to me to be yet another way to make this object really really specific such that it has a whole list of things it must pass before we recognise it as a Bytes. But that brings us back to @warpfork's original question about whether we get to reject close-but-not-quite forms or we have to fallback to parsing them as {String:Any}. If the latter, then the utility of the multibase prefix decreases a little.

Current JS impl does fallback decoding btw, very forgiving and makes room for close-but-not-quite forms: https://github.com/ipld/js-dag-json/blob/591e16194e597ca81050a8bb549635441d2cae47/index.js#L31-L42

jburnett · 2020-10-01T19:57:50Z

Could it be for the purpose of allowing others in the future? Or does the spec set it in stone forever?

rvagg · 2020-10-02T07:55:27Z

For a codec there really should be just one way, so making space for the future really shouldn't figure into it unless we can see ways that we might need wriggle room. I don't think this is one such case.

vmx · 2020-10-05T23:47:17Z

For a codec there really should be just one way,

Agreed. And as we especially would like to enforce that, I think it makes sense changing it from Multibase Base 64 to just Base 64. We actually don't want to encourage anyone to use a different multibase.

I haven't implemented the reader side because I'm not sure it's possible; the specification is insufficiently clear. I opened Issue ipld/specs#302 to track this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about parsing bytes in dag-json #302

question about parsing bytes in dag-json #302

warpfork commented Sep 24, 2020

rvagg commented Sep 25, 2020

warpfork commented Oct 1, 2020 •

edited

vmx commented Oct 1, 2020

rvagg commented Oct 1, 2020

jburnett commented Oct 1, 2020

rvagg commented Oct 2, 2020

vmx commented Oct 5, 2020

question about parsing bytes in dag-json #302

question about parsing bytes in dag-json #302

Comments

warpfork commented Sep 24, 2020

rvagg commented Sep 25, 2020

warpfork commented Oct 1, 2020 • edited

vmx commented Oct 1, 2020

rvagg commented Oct 1, 2020

jburnett commented Oct 1, 2020

rvagg commented Oct 2, 2020

vmx commented Oct 5, 2020

warpfork commented Oct 1, 2020 •

edited