Graphsync v1.1.0, encoded as CBOR #354

hannahhoward · 2021-01-22T01:11:05Z

Goals

Move graphsync to a CBOR based protocol instead of one based on protobuf

Implementation

Graphsync messages should be valid IPLD data structures encoded in DAG-CBOR. I have constructed an IPLD schema to represent what they should look like.

I've also promoted Metadata out of an extension to a core part of the response message as opposed to an extension -- this is a frequent design issue with v1.0.0.

For Discussion

This is a first stab and I have many questions:

I've assumed all structs encoded as maps, rather than tuples, with no field renames -- is this too inefficient?
I've defined the extensions as a map of {string:Any} -- last I checked Any isn't a thing except when referring to link references. At the same time, I want to specify that an extension MUST be IPLD data which is why I did not use {string:bytes}.
I prefixed everything with GraphSync... which makes things a bit verbose. But also, Priority as a type on its own for example feels pretty general. Where GraphSyncPriority makes it clear it's a priority from graphsync's point of view.

rvagg · 2021-01-22T03:24:11Z

block-layer/graphsync/graphsync.md

+Graphsync network messages are encoded in DAG-CBOR. They have the following schema
+
+```ipldsch
+type GraphSyncExtensions {string:Any}


string should be String in this position. I think Any should be fine for now if it can't be enumerated or properly pinned down, Any just hasn't fully filtered through our specs or stack because it's quite tricky. The original pb has bytes so might that be a safe choice for now? It could be turned into a union later or enumerated in some other way if needed. What types of things go into this?

sorry, just re-read your comment about why you didn't use bytes! ignore that bit then, but I'm still interested in what kind of data this might contain and if it could be enumerated? could it be limited to scalar values to rule out maps and lists perhaps? we have AnyScalar for that case if it works.

Confirm Any should be fine.

(We still have some things to wade through in #318, it seems, but I think it's something we'll have to reach a conclusion on sooner or later, and I don't think there's any way we won't have an Any by the end.)

It definitely isn't AnyScalar. We already use arrays and maps.

There are some other challenges for actually writing this in code-- the string represents the extension name, and the type is based on that, but we support user defined extensions, so it's not possible to know everything ahead of time.

There is a whole concept in cbor-gen world of Deferred -- where you don't deserialize till later, leaving in byte representation till you know what you want to deserialize to. I'm not sure this is the right approach for IPLD. But this will probably come up as well with https://github.com/ipld/go-ipld-adl-hamt -- I see there you've defined an Any type, but many HAMTs have a specific type at the bottom -- people may want to paramaterize their HAMTs and use code gen'd nodes for the day in the leaves. We should probably figure out how this might work.

Anyway though, just putting Any here at least allows me to define the schema

rvagg · 2021-01-22T03:41:45Z

Very nice. Is this going to be a breaking change, or an optional upgrade? I can't figure out if/how graphsync might do versioning but would it be a "can you talk v1.1.0? great, here's my DAG-CBOR messages"?

The map representation question might come down to how important shaving bytes off these messages is. It's nice having self-describing data, but it does introduce quite a bit of overhead to have the keys in there too.

If I take GraphSyncRequest and encode a simple form using this spec then it takes up 99 bytes. If I rename the keys down to single characters then it goes down to 62 bytes. As a tuple representation it's 48 bytes.

The problems with tuple representations:

They don't self-describe, you can't pluck a message off the wire and see what it might be doing (not a huge concern, just nice)
They are hard to change without some kind of versioning signal imposed from outside, you can add or remove items and version them implicitly by the list length, but it's hard (unadvised) to change types in the middle
You can't use optional elements (although I think we may end up allowing for the trailing element to be optional), which you could do in this current form if it made sense

So, not strong reasons to avoid tuple, but I think you're in the best position to suggest what might be best for graphsync.

warpfork · 2021-01-22T09:03:11Z

block-layer/graphsync/graphsync.md

+
+type GraphSyncMetadatum struct {
+  Link         &Any
+	BlockPresent bool


s/bool/Bool/

The rule is based on whether you're refering to a kind (which is a keyword) or a type (which are by convention TitleCase). (We're not quite the same as golang here, and don't have support them interchangably.)

type Foo int is a declaration using a kind keyword, so it's lowercase.

type Foo struct { Field Bool } is refering to a type, so it's using a name.

(There's an implicit prelude in a schema which defines type Bool bool, type String string, etc, so those are all available as a type name to make this less arduous.)

warpfork · 2021-01-22T09:03:37Z

block-layer/graphsync/graphsync.md

+  Cancel      bool                # whether this cancels a request
+  Update      bool                # whether this is an update to an in progress request


instance of type name should be TitleCase

warpfork · 2021-01-22T09:03:43Z

block-layer/graphsync/graphsync.md

+  Prefix  bytes # CID prefix (cid version, multicodec and multihash prefix (type + length)
+  Data    bytes


instance of type name should be TitleCase

warpfork · 2021-01-22T09:24:50Z

block-layer/graphsync/graphsync.md

+
+type GraphSyncRequest struct {
+  Id          GraphSyncRequestID                 # unique id set on the requester side
+  Root        &Any                # a CID for the root node in the query


This is kinda an interesting one. It's a link, alright, so this line as stated is certainly valid. But I'd like to think out loud about it for a moment.

While this is a link, it's not a link we'd ever really mean to traverse when processing the graphsync messages. It's something we expect the agent processing this message to pull out -- as a link, not immediately traversing it -- and then do some logic before it dives in to operations that load that link.

It was just "bytes" in the old protobuf, and that worked fine because of the above. And it seems we could readily still have it be Bytes in this revamp, too. In the old implementation with protobuf, the application logic did a step to reify those bytes into a CID. We could have that step be explicit in a new implementation too, if we wanted to.

This doesn't seem to make a huge difference either way. I guess the main discernible difference is that if someone ran a selector with a bunch of wildcards in it over the GraphSyncRequest itself, if this field is &Any, they might get a mouthful back, and if it's Bytes, the results wouldn't recurse as far. But does this matter? Not really, as far as I can imagine. (I don't know why someone would do that; and even if they did, selectors are full of mechanisms that can be used to control this.)

Alright, that's it for thinking out loud. This seems fine.

hah, that is an interesting point! probably doesn't matter here but worth keeping in mind that viewing this schema from a certain lens might lead to a misunderstanding about intent. &Any does provide some constraints on what the bytes can be, mainly the leading varints must make for a valid CID, but beyond that if you're not traversing then it really is bytes.

warpfork · 2021-01-22T09:32:57Z

We also typically give fields lowercase or mixedCase names.

(When you feed this into codegen in golang, it's going to generate accessor methods with appropriate export case.)

warpfork · 2021-01-22T09:36:51Z

Syntax nittery aside, this logically LGTM :D

hannahhoward · 2021-01-22T18:54:25Z

Re: lowercase names -- FYI, this is the opposite of cbor-gen, though I guess I could use the rename facility to handle this. Actually, usually cbor-gen uses tuple representation so naming is irrelevant. Unfortunately, when using map representation it does not allow renames. I have a PR to do this but it remains unmerged. (whyrusleeping/cbor-gen#46)

Sorry to keep referencing cbor-gen -- but if go-ipld-prime is to get merged through the filecoin ecosystem (which is my hope eventually) , these two will need to move toward compatibility.

hannahhoward · 2021-01-22T19:57:33Z

@rvagg yes the intent is to support both v1.0.0 and v1.1.0 for a time -- protocol negotiation is done at the libp2p level - the protocol name is already versioned, so we simply add a fallback if 1.1.0 doesn't work. We do this in Bitswap and several other protocols.

hannahhoward · 2021-01-22T22:53:18Z

@rvagg & @warpfork I believe I've addressed the comments:

Fixed capitalization
Settled on map in most cases except for tuple for metadata cause it's 2 fields and there are reasons to compress it maximally. put fields in lowercase while making abbreviated renames in uppercase as a size compromise. Generally, any time blocks are present the block size will likely dwarf the rest of the message
While i think this is mostly good to go, I'd like to actually do an implementation in go-graphsync before we merge it, to get some real world testing (though that may end up being a bit)
There's an incoming follow-on PR for some other features and implementation details I want to iron out coming shortly.

warpfork

This looks pretty darn reasonable to me!

rvagg

the schema works really well as a documentation tool here 👍

warpfork · 2021-02-22T17:36:03Z

What's still needed to advance this? At last read, both this and #355 (I assume that's the other follow-up PR you mentioned) look reasonable to me.

If this is pending an implementation, and we don't expect that to happen presently, should we... perhaps merge both these PRs, but then also relocate the file and give it a header to mark it explicitly as a future-oriented draft? (I'm happy to do the git-fu and commenting on that, if that's the path forward.)

warpfork · 2021-04-11T11:46:14Z

Ping for if we can merge this :)

hannahhoward added 2 commits January 21, 2021 17:01

feat(graphsync): define protocol as IPLD schema, encoded in CBOR

ffc2f86

feat(graphsync): include legacy protocol

7313610

rvagg reviewed Jan 22, 2021

View reviewed changes

warpfork reviewed Jan 22, 2021

View reviewed changes

feat(graphsync): cleanup schema

23eea5a

hannahhoward force-pushed the feat/graphsync-cbor branch from f00894a to 23eea5a Compare January 22, 2021 22:47

warpfork approved these changes Jan 22, 2021

View reviewed changes

rvagg approved these changes Jan 28, 2021

View reviewed changes

hannahhoward mentioned this pull request Jul 24, 2021

Enabling deduplication when retrieving overlapping selectors filecoin-project/go-data-transfer#224

Open

hannahhoward mentioned this pull request Sep 28, 2021

Replace protobuf messages with IPLD ipfs/go-graphsync#88

Open

hannahhoward mentioned this pull request Oct 23, 2021

Optimized selector traversal over graphsync ipfs/go-graphsync#253

Open

mvdan mentioned this pull request Jan 6, 2022

WIP: add the ipld plus dag-cbor protocol v1.1 ipfs/go-graphsync#323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graphsync v1.1.0, encoded as CBOR #354

Graphsync v1.1.0, encoded as CBOR #354

hannahhoward commented Jan 22, 2021

rvagg Jan 22, 2021

rvagg Jan 22, 2021

warpfork Jan 22, 2021

hannahhoward Jan 22, 2021

rvagg commented Jan 22, 2021

warpfork Jan 22, 2021

warpfork Jan 22, 2021 •

edited

warpfork Jan 22, 2021 •

edited

warpfork Jan 22, 2021

rvagg Jan 28, 2021

warpfork commented Jan 22, 2021

warpfork commented Jan 22, 2021

hannahhoward commented Jan 22, 2021

hannahhoward commented Jan 22, 2021

hannahhoward commented Jan 22, 2021

warpfork left a comment

rvagg left a comment

warpfork commented Feb 22, 2021

warpfork commented Apr 11, 2021

		Cancel bool # whether this cancels a request
		Update bool # whether this is an update to an in progress request

		Prefix bytes # CID prefix (cid version, multicodec and multihash prefix (type + length)
		Data bytes

Graphsync v1.1.0, encoded as CBOR #354

Are you sure you want to change the base?

Graphsync v1.1.0, encoded as CBOR #354

Conversation

hannahhoward commented Jan 22, 2021

Goals

Implementation

For Discussion

rvagg Jan 22, 2021

Choose a reason for hiding this comment

rvagg Jan 22, 2021

Choose a reason for hiding this comment

warpfork Jan 22, 2021

Choose a reason for hiding this comment

hannahhoward Jan 22, 2021

Choose a reason for hiding this comment

rvagg commented Jan 22, 2021

warpfork Jan 22, 2021

Choose a reason for hiding this comment

warpfork Jan 22, 2021 • edited

Choose a reason for hiding this comment

warpfork Jan 22, 2021 • edited

Choose a reason for hiding this comment

warpfork Jan 22, 2021

Choose a reason for hiding this comment

rvagg Jan 28, 2021

Choose a reason for hiding this comment

warpfork commented Jan 22, 2021

warpfork commented Jan 22, 2021

hannahhoward commented Jan 22, 2021

hannahhoward commented Jan 22, 2021

hannahhoward commented Jan 22, 2021

warpfork left a comment

Choose a reason for hiding this comment

rvagg left a comment

Choose a reason for hiding this comment

warpfork commented Feb 22, 2021

warpfork commented Apr 11, 2021

warpfork Jan 22, 2021 •

edited

warpfork Jan 22, 2021 •

edited