Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Graphsync v1.1.0, encoded as CBOR #354

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

hannahhoward
Copy link
Collaborator

Goals

Move graphsync to a CBOR based protocol instead of one based on protobuf

Implementation

Graphsync messages should be valid IPLD data structures encoded in DAG-CBOR. I have constructed an IPLD schema to represent what they should look like.

I've also promoted Metadata out of an extension to a core part of the response message as opposed to an extension -- this is a frequent design issue with v1.0.0.

For Discussion

This is a first stab and I have many questions:

  1. I've assumed all structs encoded as maps, rather than tuples, with no field renames -- is this too inefficient?
  2. I've defined the extensions as a map of {string:Any} -- last I checked Any isn't a thing except when referring to link references. At the same time, I want to specify that an extension MUST be IPLD data which is why I did not use {string:bytes}.
  3. I prefixed everything with GraphSync... which makes things a bit verbose. But also, Priority as a type on its own for example feels pretty general. Where GraphSyncPriority makes it clear it's a priority from graphsync's point of view.

Graphsync network messages are encoded in DAG-CBOR. They have the following schema

```ipldsch
type GraphSyncExtensions {string:Any}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string should be String in this position. I think Any should be fine for now if it can't be enumerated or properly pinned down, Any just hasn't fully filtered through our specs or stack because it's quite tricky. The original pb has bytes so might that be a safe choice for now? It could be turned into a union later or enumerated in some other way if needed. What types of things go into this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, just re-read your comment about why you didn't use bytes! ignore that bit then, but I'm still interested in what kind of data this might contain and if it could be enumerated? could it be limited to scalar values to rule out maps and lists perhaps? we have AnyScalar for that case if it works.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm Any should be fine.

(We still have some things to wade through in #318, it seems, but I think it's something we'll have to reach a conclusion on sooner or later, and I don't think there's any way we won't have an Any by the end.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely isn't AnyScalar. We already use arrays and maps.

There are some other challenges for actually writing this in code-- the string represents the extension name, and the type is based on that, but we support user defined extensions, so it's not possible to know everything ahead of time.

There is a whole concept in cbor-gen world of Deferred -- where you don't deserialize till later, leaving in byte representation till you know what you want to deserialize to. I'm not sure this is the right approach for IPLD. But this will probably come up as well with https://github.com/ipld/go-ipld-adl-hamt -- I see there you've defined an Any type, but many HAMTs have a specific type at the bottom -- people may want to paramaterize their HAMTs and use code gen'd nodes for the day in the leaves. We should probably figure out how this might work.

Anyway though, just putting Any here at least allows me to define the schema

@rvagg
Copy link
Member

rvagg commented Jan 22, 2021

Very nice. Is this going to be a breaking change, or an optional upgrade? I can't figure out if/how graphsync might do versioning but would it be a "can you talk v1.1.0? great, here's my DAG-CBOR messages"?

The map representation question might come down to how important shaving bytes off these messages is. It's nice having self-describing data, but it does introduce quite a bit of overhead to have the keys in there too.

If I take GraphSyncRequest and encode a simple form using this spec then it takes up 99 bytes. If I rename the keys down to single characters then it goes down to 62 bytes. As a tuple representation it's 48 bytes.

The problems with tuple representations:

  • They don't self-describe, you can't pluck a message off the wire and see what it might be doing (not a huge concern, just nice)
  • They are hard to change without some kind of versioning signal imposed from outside, you can add or remove items and version them implicitly by the list length, but it's hard (unadvised) to change types in the middle
  • You can't use optional elements (although I think we may end up allowing for the trailing element to be optional), which you could do in this current form if it made sense

So, not strong reasons to avoid tuple, but I think you're in the best position to suggest what might be best for graphsync.


type GraphSyncMetadatum struct {
Link &Any
BlockPresent bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/bool/Bool/

The rule is based on whether you're refering to a kind (which is a keyword) or a type (which are by convention TitleCase). (We're not quite the same as golang here, and don't have support them interchangably.)

type Foo int is a declaration using a kind keyword, so it's lowercase.

type Foo struct { Field Bool } is refering to a type, so it's using a name.

(There's an implicit prelude in a schema which defines type Bool bool, type String string, etc, so those are all available as a type name to make this less arduous.)

Comment on lines 128 to 129
Cancel bool # whether this cancels a request
Update bool # whether this is an update to an in progress request
Copy link
Contributor

@warpfork warpfork Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 140 to 141
Prefix bytes # CID prefix (cid version, multicodec and multihash prefix (type + length)
Data bytes
Copy link
Contributor

@warpfork warpfork Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


type GraphSyncRequest struct {
Id GraphSyncRequestID # unique id set on the requester side
Root &Any # a CID for the root node in the query
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda an interesting one. It's a link, alright, so this line as stated is certainly valid. But I'd like to think out loud about it for a moment.

While this is a link, it's not a link we'd ever really mean to traverse when processing the graphsync messages. It's something we expect the agent processing this message to pull out -- as a link, not immediately traversing it -- and then do some logic before it dives in to operations that load that link.

It was just "bytes" in the old protobuf, and that worked fine because of the above. And it seems we could readily still have it be Bytes in this revamp, too. In the old implementation with protobuf, the application logic did a step to reify those bytes into a CID. We could have that step be explicit in a new implementation too, if we wanted to.

This doesn't seem to make a huge difference either way. I guess the main discernible difference is that if someone ran a selector with a bunch of wildcards in it over the GraphSyncRequest itself, if this field is &Any, they might get a mouthful back, and if it's Bytes, the results wouldn't recurse as far. But does this matter? Not really, as far as I can imagine. (I don't know why someone would do that; and even if they did, selectors are full of mechanisms that can be used to control this.)

Alright, that's it for thinking out loud. This seems fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, that is an interesting point! probably doesn't matter here but worth keeping in mind that viewing this schema from a certain lens might lead to a misunderstanding about intent. &Any does provide some constraints on what the bytes can be, mainly the leading varints must make for a valid CID, but beyond that if you're not traversing then it really is bytes.

@warpfork
Copy link
Contributor

We also typically give fields lowercase or mixedCase names.

(When you feed this into codegen in golang, it's going to generate accessor methods with appropriate export case.)

@warpfork
Copy link
Contributor

Syntax nittery aside, this logically LGTM :D

@hannahhoward
Copy link
Collaborator Author

Re: lowercase names -- FYI, this is the opposite of cbor-gen, though I guess I could use the rename facility to handle this. Actually, usually cbor-gen uses tuple representation so naming is irrelevant. Unfortunately, when using map representation it does not allow renames. I have a PR to do this but it remains unmerged. (whyrusleeping/cbor-gen#46)

Sorry to keep referencing cbor-gen -- but if go-ipld-prime is to get merged through the filecoin ecosystem (which is my hope eventually) , these two will need to move toward compatibility.

@hannahhoward
Copy link
Collaborator Author

@rvagg yes the intent is to support both v1.0.0 and v1.1.0 for a time -- protocol negotiation is done at the libp2p level - the protocol name is already versioned, so we simply add a fallback if 1.1.0 doesn't work. We do this in Bitswap and several other protocols.

@hannahhoward
Copy link
Collaborator Author

@rvagg & @warpfork I believe I've addressed the comments:

  1. Fixed capitalization
  2. Settled on map in most cases except for tuple for metadata cause it's 2 fields and there are reasons to compress it maximally. put fields in lowercase while making abbreviated renames in uppercase as a size compromise. Generally, any time blocks are present the block size will likely dwarf the rest of the message
  3. While i think this is mostly good to go, I'd like to actually do an implementation in go-graphsync before we merge it, to get some real world testing (though that may end up being a bit)
  4. There's an incoming follow-on PR for some other features and implementation details I want to iron out coming shortly.

Copy link
Contributor

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty darn reasonable to me!

Copy link
Member

@rvagg rvagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the schema works really well as a documentation tool here 👍

@warpfork
Copy link
Contributor

What's still needed to advance this? At last read, both this and #355 (I assume that's the other follow-up PR you mentioned) look reasonable to me.

If this is pending an implementation, and we don't expect that to happen presently, should we... perhaps merge both these PRs, but then also relocate the file and give it a header to mark it explicitly as a future-oriented draft? (I'm happy to do the git-fu and commenting on that, if that's the path forward.)

@warpfork
Copy link
Contributor

Ping for if we can merge this :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants