Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

codec: pathed link #352

Open
mikeal opened this issue Jan 19, 2021 · 9 comments
Open

codec: pathed link #352

mikeal opened this issue Jan 19, 2021 · 9 comments

Comments

@mikeal
Copy link
Contributor

mikeal commented Jan 19, 2021

Wanted to open up a discussion about this particular idea.

We’ve had conversations for a while about how to represent a link as a (CID + Path) but haven’t agreed on anything stable yet.

One thought I had was to create a codec and simple block format for pathed links.

| multicodec | multihash | utf8(path) |
type PathedLink struct {
  link Link
  path String
} representation map

You could use the identity multicodec to inline the relevant data into a single CID and end up with a “pathed link.” Of course, the data model representation would not automatically traverse unless configured to do so but that’s ok, we need the data model to remain stable anyway. This would give us a link level indicator of how to traverse and we could instrument whatever special traversal logic we might need when and where we need it and are ready for it.

We also get a very compact representation since we’re able to shave some bytes in the block format.

@Stebalien
Copy link
Contributor

I take it that's just everything concatenated? Works for me! Also note: utf8(/) is the multicodec for "paths" (:trollface:), so every part of this is multicodec prefixed.

Also related: multiformats/multiformats#55.

@mikeal
Copy link
Contributor Author

mikeal commented Jan 19, 2021

Also note: utf8(/) is the multicodec for "paths" (:trollface:), so every part of this is multicodec prefixed.

Oh that is awesome!

@rvagg
Copy link
Member

rvagg commented Jan 20, 2021

For clarity, can you describe the sections of bytes that end up forming the final CID? I'm not quite clear on how you're getting to the end product. Is it just the | multicodec | multihash | utf8(path) | - which would be backward incompatible with the current CID parsers. Or would it be | pathedlink-multicodec | identity | multicodec | multihash | utf8(path) |, so a CID+path wrapped up in a raw+identity CID, which is how I'm interpreting "You could use the identity multicodec to inline the relevant data into a single CID".

@mikeal
Copy link
Contributor Author

mikeal commented Jan 20, 2021

For clarity, can you describe the sections of bytes that end up forming the final CID?

Sure.

It’s also worth pointing out that the format is essentially just a CID without the proceeding 1 (CIDv1) followed by the path.

Here’s a fully inline pathed CID.

| CIDv1 | pathed-link-multicodec | identity-multicodec | identity length | link-codec-multicodec | link-hash-multicodec | link-hash-length | link-hash-bytes | utf8(path) |

I should also note that we’ll need to apply some rules to the path in order to ensure determinism (no leading or trailing slash).

@Stebalien
Copy link
Contributor

Wait, so you're not just concatenating a CID and a path? You're suggesting a new object type, stored as an "inline/identity" CID? I mean, that works, but it seems like just extending the CID format to allow tacking on a path would be cleaner.

@mikeal
Copy link
Contributor Author

mikeal commented Jan 20, 2021

The goal here is to add this functionality in a generic way to IPLD (in other words, it should work for links to/from any existing block format) without actually breaking the IPLD Data Model (which extending the feature set of links would do).

This is “just a new block format” specifically for pathed links. That means it has a representation that conforms to the existing IPLD Data Model as it is today without any changes. Since it’s implemented as a block format but is intended to be a link itself, the sane thing to do is to embed it in an identity multihash.

It may seem a little hacky but it’s only 2 extra bytes of identity multihash overhead, which you actually gain back in the block format when compared to encoding the same data in CBOR.

The important thing is that there is an identifier (multicodec) in any link that you can use to identify pathed links. This would allow any IPLD user to add pathed link support to their implementation and have it work across all codecs without changing or breaking the existing data model and it would still produce graphs that contain all the relevant linking information in just the Data Model representation.

In practice, I don’t think there’s much difference between this and “extending the CID format” other than the fact that this is reverse compatible with systems that don’t understand pathed links. If you imagine extending the format, you’d end up putting bytes somewhere that say “this is a pathed link,” which we’re effectively doing with CID’s existing codec field, we’re just then eating two bytes for the identity multihash which we might have avoided had we gone a route that wasn’t reverse compatible.

@Stebalien
Copy link
Contributor

I guess... My concerns are:

  1. Unless handled "specially", these links will appear to be new blocks and would have to be handled at a higher layer (e.g., ADL). I have to wonder how this would interact with pathing, selectors, etc.

In terms of not breaking things, yeah, I get that. I'm just concerned about this feature having limited use if it lives outside the core data model.

@mikeal
Copy link
Contributor Author

mikeal commented Jan 20, 2021

In terms of not breaking things, yeah, I get that. I'm just concerned about this feature having limited use if it lives outside the core data model.

We sort of have to pick one of these. If it changes the core data model we break everything, including the existing codec definitions, so that ship has sailed.

That said, pretty much everything we’ve built w/ IPLD includes things beyond the data model. IPLD Schemas are the obvious example, and I’m curious to know if there’s a way that we could get pathed links into IPLD Schemas.

@rvagg
Copy link
Member

rvagg commented Jan 21, 2021

We should enumerate some reasonable use-cases for this so we can figure out if this proposal would make sense for those. It seems to me that there's going to be special-casing no matter how we implement such a thing, this one has the benefit of reusing the "inline CID" pattern which I think we've agreed needs to be baked into our stack. But there's going to be additional "is this a 0x2f + identity CID?" check at various points of the stack too, which will break some abstractions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants