Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Invariants for IPLD Codecs #328

Open
vmx opened this issue Nov 4, 2020 · 3 comments
Open

Invariants for IPLD Codecs #328

vmx opened this issue Nov 4, 2020 · 3 comments

Comments

@vmx
Copy link
Member

vmx commented Nov 4, 2020

One of the ideas of IPLD is that it defines a Data Model that is independent of the underlying codec. You can leverage existing formats like CBOR, JSON or Protocol Buffers or create your own special purpose one. IPLD Codecs usually specify additional restrictions on top of the codecs, in order to make them practical for content addressing. Things like restricting the types that can be used (so that there is only one way to encode each Data Model Kind) or the sort order in maps. That's the reason there are those DAG-* codecs.

So far there is no explicit invariant in the specs that those IPLD Codecs are a strict subset of the underlying codec (I always assumed that, but it has never been written down AFAIK). That is an important feature, as this way you can deserialize and serialze any data which was encoded by a DAG-* codec with codecs that comply to their original specs.

A concrete example is that the Protocol Buffers reference implementations can deserialize any DAG-PB data and serialize it again into a byte-identical copy of the original data. You wouldn't need to use any of the tooling provided by the IPLD project, you could just use existing parsers/implementations of the codec.

This ensures that data produces by IPLD is interoperable even outside of IPLD itself.

Hence I propose putting this invariant explicitly into the specs, else IPLD wouldn't live up to the ideas of being independent of the underlying codec and interoperable with a wide variety of other systems.

@warpfork
Copy link
Contributor

warpfork commented Nov 4, 2020

I think there's some good idea here, but also we should be cautious of overgeneralizing this.

Not all IPLD codecs have an "underlying" codec. For those that do: our specs will still be clearest if we specify what the IPLD codec does first; and specify what relationship this has to any other codecs in the wild second. And in all forms of interop: the practical details matter; and what various tools and libraries do in the wild can be just as interesting as what a spec or a reference implementation says, especially if those things diverge.

I think our codecs specs and documentation should be individually clear, for each codec, about any other widely-known systems they expect to be cross-compatible with, and how, and any conditions and limits there may be on that.

(And we need these detailed statements anyway: they're the other half of fully specifying any increasing strictness an IPLD codec might have versus a general understanding of that codec.)

Having the holistic goal of interop in mind for those codecs which do aim to have interoperability with existing systems is... good? But also tautological. If there's some phrasing of this that will help us write detailed interoperability reports per codec, I'm all for it; I'm just not sure what kind of statement that would be and what kind of explicitness it can really have that will be useful.

@vmx
Copy link
Member Author

vmx commented Nov 5, 2020

I don't think I made my point clear enough. I don't want to oppose additional constraints on codecs, I'd like to document on the current state.

I think what I describe is how people currently understand, implement and use IPLD Codecs. I want to make sure our specs are precise and remove the chance for misinterpretation.

@warpfork would it help if more people comment/emoji here whether what I describe here matches their expectations about IPLD Codecs or not? I ask, because I observe a disconnect between the IPLD Team and the outside world. And I obviously also only have a limited view of the outside world. So it might help to get people from various backgrounds to chime in here.

@vmx
Copy link
Member Author

vmx commented Nov 6, 2020

Here's a quick update after talking to various folks (thanks everyone!).

The byte-identical copy of the original data is not possible to enforce. Spec compliant implementations won't be strict enough. Examples:

  • @aschmahmann mentioed that Protocol Buffers might not always get serialized to the same bytes even if the input data is the same
  • @ribasushi mentioned float encoding in CBOR. The CBOR spec has some recommendation that you could use the smallest representation possible. I.e. if you have a 64-bit IEE-754 float that can be represented lossless in an 32-bit IEEE-754 float, then use a 32-bit float (example of such a float). Though implementations can just not do it (we plan for DAG-CBOR to always require 64-bit floats, which is what the Go implementation is already doing). So there could be two implementations encoding it differently, while still both being spec compliant.

Hence I propose a less strict version:

In case there is an underlying codec, the data produced by an IPLD Codec MUST be decodable by any spec compliant/reference implementation of the underlying codec.

"underlying codec" means a existing codec we apply additional constraints on. Examples are DAG-CBOR and DAG-JSON, where the underlying codecs would be CBOR and JSON.


We will likely get into SHOULD vs. MUST discussions. The reasons I'm in favour of" MUST":

So far all cases I've seen where the condition above wasn't met were bugs. Of course there are bugs that lead to data that doesn't comply to a spec (hence violate the MUST), that's the nature of bugs. I don't think those bugs should weaken the spec itself. The reasons those bugs exists is due to the spec not being precise enough. It's now the chance to add this precision. I don't think it should be a "SHOULD", just because there could be bugs.

This doesn't mean that I want to break all the data that was produced due to the bugs. I think it's totally fair to create libraries that can deal with such data. They are free to do so and even should if they care about backwards compatibility with pre-existing data. Though the MUST not produce such invalid data moving forward.

"producing data" is a bit fuzzy and I'd like that we operate with common sense here and not putting too much efforts on tighten the spec language. So if your library deals with invalid data, it might as well write it again. What I'm after is that we try to make those bugs less likely in the future.

This also (to me) does not mean that you always need to ensure that the data you write is valid. You just need to acknowledge that your data you produce might be invalid and might not work with other implementations. This would be similar to what Google is doing with Protocol Buffers. In Go you can produce invalid data (strings with arbitrary bytes), but you won't be able to read that with the Python implementation. Or if you store arbitrary bytes in strings in the Protocol Buffers JavaScript implementation (without doing any special additional work), they will be serialized differently from the ones that the Go implementation would serialize (although the original input bytes were the same).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants