Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway #431

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

bajtos
Copy link

@bajtos bajtos commented Aug 8, 2023

Define an optional enhancement of the CARv1 stream that allows a Gateway server to provide additional metadata about the CARv1 response. Introduce a new content type that allows the client and the server to signal or negotiate the inclusion of extra metadata.

The PR discussing a new multi-codec car-metadata: multiformats/multicodec#334

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>
Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>
@bajtos
Copy link
Author

bajtos commented Aug 8, 2023

@lidel @willscott Here is my proposal for allowing gateway clients to request the response to include a metadata block at the end.

This is my first IPIP. Please let me know what and how to improve, where to add more details, etc. Feel free to edit the text directly if you like (edits by maintainers are allowed).

@lidel lidel changed the title IPIP: CAR meta (content type parameter) IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway Aug 8, 2023
Copy link
Member

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for submitting this @bajtos.

I like the framing of this as an extensible opt-in CAR manifest. Quick thought:

  • Best case, it will solve multiple problems (retrieval attestation, interrupted streams) without inventing anything new (reusing CARv1, DAG-CBOR, CAR content type parameters from Trustless Gateway spec).
  • Worst case, will have niche utility, but will create a standard for the ecosystem on how random metadata can be passed between paid HTTP services, allowing CAR-aware clients to identify it and strip it out before storing in caches.

Made some editorial tweaks + quick first-pass feedback in comments inline.

Comment on lines 324 to 328
- `b3h` - Blake3 hash (checksum) of the CAR data (excluding the metadata block).
- `b3h_sig` - A signature over `<len><b3h><request>` using server's Ed2559 identity.
- `len` is encoded as `varint`,
- `b3h` is encoded as 32 bytes,
- The effective query as executed by the gateway. This query is the request url - path and query string arguments.
Copy link
Member

@lidel lidel Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These b3* fields are specific to SPARK retrieval attestation and should not be listed in the trustless gateway spec as a MUST. These may be mandatory for SPARK, but are optional for the rest of IPFS ecosystem.

Please move them to "User benefit" section of the IPIP document and explain how meta=eof enables SPARK use case by allowing for these custom signatures to be passed along with the data. It makes a good example of extensibility that does not require PL's permission.

ps. I know other services like dagHouse use different hash functions for getting "CAR CID", putting all bets on Blake3 feels like an unnecessary divergence.

Perhaps this could be made bit more future-proof and generic if blake3 is represented as Multihash wrapped in CIDv1+car codec (0x0202)? Just an idea, fine to ignore, given these are specific to SPARK.

Either way, this belongs to the "userland benefiting from metadata extensibility" story.

Suggested change
- `b3h` - Blake3 hash (checksum) of the CAR data (excluding the metadata block).
- `b3h_sig` - A signature over `<len><b3h><request>` using server's Ed2559 identity.
- `len` is encoded as `varint`,
- `b3h` is encoded as 32 bytes,
- The effective query as executed by the gateway. This query is the request url - path and query string arguments.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the Spark use case belongs in Userland section. However, the individual keys of the metadata section, and what the servers must do to implement them, feels like something that should be in the trustless gateway spec.

Keys like car_bytes, data_bytes, block_count will be used by Spark but also may be used by others, and the definition of a key (i.e. what does the server actually return as the value for each key) must be the same for each use case. E.g. if one use case sets data_bytes to be the total byte length of blocks and another use case sets it to be the total byte length of the CAR stream then trustless gateway implementers will need to implement different logic for each different use case.

What's more, for the Spark use case, we do not want gateway operators to know that they are serving a Spark request and not some other request. Since the Spark ones will be incentivised and other request may not be, servers may simply provide a good retrieval service to Spark clients and a poor service to other clients.

car_bytes, data_bytes, block_count seem generic enough. The troublesome one is then the Blake3Hash and signature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps what we need to do is leave this IPIP to concern the metadata block being appended without any constraints on what can be included in it. Then in a separate place, we define a canonical way to include a key value object in the metadata block and how the server should implement certain useful keys such as car_bytes et al

Copy link
Member

@lidel lidel Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be ok to suggest a key name convention for generic things like car_bytes data_bytes and block_count in the section that described meta parameter, as long it is

  • scoped to JSON (perhaps list under explicit meta=eof[+json]?)
  • change requirement from MUST to SHOULD (convention, not a hard requirement)

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

+1 to document car_cid.

For SPARK, we specifically want a Blake3 hash so that we can use inclusion proofs. That's why we want to use a dedicated field b3checksum instead of a more generic car_cid.

Comment on lines 324 to 328
- `b3h` - Blake3 hash (checksum) of the CAR data (excluding the metadata block).
- `b3h_sig` - A signature over `<len><b3h><request>` using server's Ed2559 identity.
- `len` is encoded as `varint`,
- `b3h` is encoded as 32 bytes,
- The effective query as executed by the gateway. This query is the request url - path and query string arguments.
Copy link
Member

@lidel lidel Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including content path and CAR export parameters feels generic enough to keep in the spec, but we should not mix content path with car and url parameters as it leads to bugs around things like percent-encoding especially where ? or / is involved (cc ipfs/gateway-conformance#115).

These should be three separate fields:

  • content_path - requested content path
  • dag_params - map with DAG params like dag-scope, entity-bytes from IPIP-402
  • car_params - map with CAR content type params like order and dups from IPIP-412

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel what is the reasoning behind splitting up dag_params and car_params? Could we instead go for content_path and query_params to keep it simple and generic (allowing for other query params)?

Copy link
Member

@lidel lidel Sep 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that was a conscious design choice, to avoid mixing data selector with details of the transport format (not everything is in URL query params):

  • dag_params are about what user data was selected, not tied to any specific transport (could be applied to something other than CAR)
    • these are things that land in URL query
  • car_params are specific to CAR container format, they do not change the user data that was selected, only the way it is represented when sent as CAR
    • these are things that land in Accept/Content-Type headers
  • people read "query" and assume URL query :)

@patrickwoodhead that being said, if you want to simplify, this IPIP could go with a single dag-json map named response_params (to avoid confusion with URL query params, and account for the fact that server may ignore some of request params when producing a response)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using retrieval_params instead of response_params? In my mind, response parameters don't describe "what user data was selected".

What is our motivation in SPARK:
We want the gateway to describe what exactly the client is retrieving (CID, subpath, dag params, car params) and provide a signature over that.

If two SPARK checker clients submit a metadata block with the same retrieval parameters (CID, subpath, dag params, car params) then we want:

  • To be able to verify that the clients retrieved the DAG (sub)tree they were expected to retrieve.
  • Confidence that both clients made the same request to the gateway (semantically?) and were supposed to receive exactly the same response from the gateway.

content_path - requested content path

When the client requests GET /ipfs/bafy1234/cat.jpg, what is content_path?

  • /ipfs/bafy1234/cat.jpg
  • bafy1234/cat.jpg
  • /cat.jpg
  • something else?

We need the metadata block to describe both the CID requested (bafy1234) and the resource subpath if specified (/cat.jpg).

src/http-gateways/trustless-gateway.md Outdated Show resolved Hide resolved
IPIP/0431-gateway-car-trailer.md Outdated Show resolved Hide resolved
src/ipips/ipip-0431.md Show resolved Hide resolved
src/ipips/ipip-0431.md Outdated Show resolved Hide resolved
@rvagg
Copy link
Member

rvagg commented Aug 9, 2023

I don't think we're going to be able to do a new CID codec code for this (car-metadata), see discussion over in the multicodec thread. I don't really want to go into that here (there's been many keystrokes spilled over this question for a few years now). Unfortunately the limits of CARv1 keep on getting in our way and the focus on the blockstore use-case of CARv2 means it's not great as a transport format. But (aside from just doing CARv3 and fixing all the things) there's a few things we could consider:

  1. CAR with special last block, but not a special codec

If you supply ?meta then you're opting in to a special format. Clients that do this, like Lassie, would then get to assume that the last block likely has special meaning—we just need to attempt to pass it through a well-defined schema (see the one I proposed in the multicodec thread) and if it matches then strip it from the CARv1 payload and do <special thing> with the contents.

Unfortunately this option mixes up the metadata in the payload it's trying to describe, but at least we have mechanisms to signal the special nature of it.

  1. CAR with 0x00 end and metadata block following.

If we want to avoid the metadata being within the CARv1 payload. The ZeroLengthSectionAsEOF option that we have for go-car (I don't think we did it yet for js-car but it's straightforward) is already used inside Filecoin to delimit the end of a CAR body. We could make ?meta mean that you parse the CAR stream looking for the 0x00 || EOF to signal the end. Then any bytes beyond that can be whatever we need them to be—ideally a well-formed IPLD block conforming to a strict schema so we don't have opportunities for abuse.

One problem is that if you curl -o such a file with ?meta then you end up with a CAR that won't play nicely with most CAR tooling because ZeroLengthSectionAsEOF is a special library option (currently, we could add it to the go-car CLI).

  1. CARv2 with "metadata" index

CARv2 isn't really designed as a transport format, that's why we're using CARv1, but we may be able to squish it into a usable form to help with this problem. We have a "characteristics" field we can play with, and there's an index section we can also play with.

We can define a new index format very easily, and the CAR decoders will just complain if they can't read the index, which is fine for this purpose. We could define an index as a "trustless gateway metadata index"; which doesn't get the location data, but we can use it to put whatever we want into the trailer of the CARv2 stream—we could just encode well-defined IPLD block strictly conforming to a schema, to present this metadata and anything else we want.

The main problem is that CARv2 requires a "DataSize" in the header to tell us the length of the CARv1 payload, which we don't have up-front, and an "IndexOffset" to tell us where the index starts, which we don't have for the same reasons. We've used IndexOffset==0 to signal no-index to date. But we could use the characteristics field to deal with these problems. We just need to parse the CARv1 payload with the ZeroLengthSectionAsEOF option turned on, which we can use as the signal for the end of the CARv1 data. We could also make that characteristic bit tell us that if IndexOffset==0 but there are more bytes after the payload, to interpret it as an index. Then the CAR decoder would fine our new index format, containing our metadata.

We get to leave the CARv1 intact, in the same form that you would get it if you didn't turn meta parameter on, and we get a ~valid CARv2 format file that could be read with a standard CAR decoder .. with the caveat that there's a couple of things here that will probably make existing parsers bork at it until we update the code and ship new versions. Arguably an extension of CARv2 rather than hard break to the format. Current CAR tooling may have problems, but we could produce new versions that are fine with these things in a way that doesn't have to compromise on anything else.


As an aside, we could use any of these options to do our error signalling, which I'm pretty keen on having. A schema for this metadata block could be a union type of the metadata presented here or an error string, so we can log and possibly pass on that information up the chain (still likely having to do the bad-chunk ending thing for cache reasons).

@willscott
Copy link
Contributor

the keys / schema presented here, i think, should be considered an example. I would hope that it would be treated as an arbitrary key-value map of metadata objects, and that these key-values could be used to signal an error, could be used to signal an 'eof' signal, and/or could be used to provide additional check-sum attestation as described in the current text

@rvagg
Copy link
Member

rvagg commented Aug 10, 2023

I'm prototyping a form of option 1 above with Frisbii and Lassie, will let you know how it goes.

rvagg added a commit to ipld/frisbii that referenced this pull request Aug 10, 2023
rvagg added a commit to ipld/frisbii that referenced this pull request Aug 10, 2023
rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023
rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023
rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023
rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023
rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023
@rvagg
Copy link
Member

rvagg commented Aug 11, 2023

filecoin-project/lassie#378 and ipld/frisbii#15 demonstrate an approximation of the option 1 I presented above.

  • A meta=eof item in the Accept header is interpreted by the server as needing the trailing metadata to be sent, otherwise proceed as normal.
  • Server writes the CAR, then a NUL byte, then a dag-json (opted for dag-json because it's human readable in case someone inspects the output in confusion) metadata chunk that includes some information about the CAR stream. Response is sent with a Content-Type that includes meta=eof.
  • Client sees meta=eof in the Content-Type and reads the CAR using ZeroLengthSectionAsEOF; after that it slurps up the remaining bytes as dag-json into the metadata schema. It can then do <stuff> with it.

<stuff> that I've chosen to do at the moment is validate the car body according to the stats in the metadata and print out a "checksum" multihash that was provided by the server—this could be a signed checksum, or whatever it needs to be. There's also affordance for an error property in the metadata to signal that an error occurred and a message about it, so we get to see messages such as "block not found" now.

@rvagg
Copy link
Member

rvagg commented Aug 11, 2023

I'll acknowledge that it may be better to just go with the plain map approach, without a schema, as Will's suggested. That would even let us do novel things for specific situations like having Lassie tell Saturn about retrieval clients and their timings (currently can only use Server-Timings header which ends when the data starts). But there's a bit of a can of worms that we open up that I wondered if we could avoid by having a strong schema, at least for the first version. All of the things that http headers have to deal with - like what to do with duplicate keys, what limits we need to put on the sizes of things to avoid abuse, etc. Constraining within the bounds of dag-json, which itself is a bit strict, and having a schema, let's us be very clear about rules and avoid abuse.
But, maybe it's better to basically replicate the Trailers section but in the payload. 🤷

@lidel
Copy link
Member

lidel commented Aug 17, 2023

@rvagg prefixing metadata with NUL changes the scope of this IPIP, effectively moves metadata outside CARv1 serialization format that we have specs and compatible implementations, which makes it harder to argue reuse of the same content type as CAR.

Existing CARv1 implementations will error without explicit support for ZeroLengthSectionAsEOF, right?

This thing starts looking like a new content type, changing the scope of this IPIP to something similar to application/vnd.ipld.car-stream from "Alternatives" section.

Not saying it is bad, maybe a separate content type for streaming CARs is the right call here. It mitigates risks around mixing regular CAR responses with ones that include metadata trailer and causing issues on clients that don't support ZeroLengthSectionAsEOF, and poisoning HTTP caches in scenarios when a rogue Saturn L1 sends a fishy CAR response.

But i'm worried about duplicated effort across teams and project in light of CARv3, which (iiuc) also needs to happen some time in the next ~12-24 months and might have overlapping scope, solving similar problems.


@willscott @bajtos is this IPIP something we intend to expose on all gateways and support forever in the IPFS ecosystem, even when we have CARv3? Would this be intended for wrapping CARv2 and v3 too? Or is this just a stop-gap for Rhea/Boost internally until we have CARv3 with built-in metadata/eof support?

@bajtos bajtos marked this pull request as ready for review October 18, 2023 14:04
@bajtos bajtos requested a review from a team as a code owner October 18, 2023 14:04
@bajtos
Copy link
Author

bajtos commented Oct 18, 2023

Hello folks; thank you for your patience! Together with @patrickwoodhead, we incorporated your feedback and updated both the proposal and the spec.

We are ready for the next round of reviews. 🙏🏻

Comment on lines +372 to +375
"content_path": {
"description": "The url path in the request as executed by the gateway, e.g. `/ipfs/bafy1234/cat.jpg`. The query string MUST BE stripped from the path.",
"type": "string"
},
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point leading back to #431 (comment):

How do we represent the information about what content was requested?

  • The CID
  • An optional path to a file inside UnixFS

Comment on lines +329 to +332
"data": {
"type": "object",
"description": "Properties of the response"
},
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion point:

In the current proposal, the top-level "data" object combines fields about "what was requested" (e.g. CAR & DAG params) with "what was returned" (e.g. CARv1 length in bytes).

I'd like to discuss an alternative: split data into two fields req and res. The first will describe what the client requested, the second will describe what the server returned.

Such division would allow us to shorten field names, e.g. data.car_params.dup can become req.dups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting into req and res sgtm, improves clarity


When the parameter is not set or does not equal `eof+json`, the server SHOULD not add any extra blocks to the response, neither the 0x00 byte nor any metadata.

When `meta=eof+json`, the JSON object SHOULD conform to the following [JSON schema](https://json-schema.org/).
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion points:

  • In the current spec & IPIP, we are formatting metadata as JSON. Should we say DAG-JSON instead?
  • Do we want to serialise the metadata as a CAR block, prefixing the JSON data with varint | CID header?

Copy link
Member

@lidel lidel Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willscott @rvagg thoughts? Value added in DAG-JSON prefixed with own CID is that it allows client to detect truncation beyond 0x00 byte.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe clients can already easily detect truncation of the metadata block.

  • The block is a DAG-JSON object, it must start with { and end with a matching }.
  • If the block is truncated, it will not end with the matching } and the JSON parser will throw an error.

Comment on lines +191 to +193
TBD

Using one CID, request the CAR data using various combinations of content type parameters.
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this TODO to show in the PR discussion.

- native truncation detection and standardized error handling and passing during streaming
- support for things like [Large Blocks](https://discuss.ipfs.tech/t/supporting-large-ipld-blocks/15093/)

TODO: link to some public artifact about CARv3
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging this TODO to show in the PR discussion.

Any suggestions for the artefacts I can link to?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aschmahmann do we have anything on GH?

@bajtos
Copy link
Author

bajtos commented Oct 18, 2023

@bajtos
Copy link
Author

bajtos commented Nov 6, 2023

@lidel @rvagg @willscott Ping 👋🏻 What's the best way to move this proposal forward?


- The metadata `sig` field SHOULD also be populated, returning a signature, using the server's Ed2559 identity, over the metadata properties object. This allows gateway clients to submit the metadata block as an attestation of retrieval that 3rd parties can verify.

### Compatibility
Copy link
Member

@lidel lidel Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos Let's go extra mile here and elaborate what happens when CAR response with 0x00-prefixed suffix is parsed by existing CAR software.

My suggestion is to add some clear statement about expected interop, like "libraries and implementations SHOULD ignore the suffix after 0x00", otherwise we will create a bad UX/DX, where developer tries to debug things with existign tooling and the tooling errors.

I imagine we don't want things to fail due to 0x00 suffix, bare minimum being:

  • >80% of Amino DHT IPFS network (including IPFS Desktop and Brave) is Kubo
    • ipfs dag import should ignore suffix
  • reference CAR libraries ignore 0x00 by default
    • js-car (JS library used by things like custom Service Workers, Helia)
    • go-car v1 and v2 (GO libraries)
      • Caveat: I think @rvagg mentioned this may not be possible, because of Filecoin-specific logic present in the library?
  • CLI tools we recommend to developers, they will try to use these for debugging CAR responses with the suffix:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go extra mile here and elaborate what happens when CAR response with 0x00-prefixed suffix is parsed by existing CAR software.

It's a great idea to think about compatibility with existing & future tooling and clearly describe our thinking. 👍🏻

The most important aspect is avoiding the "0x00 insertion attack" vector. You can find more details in the section Zero-length-block insertion attacks (including the Filecoin-specific logic). I am cross-posting the mitigation I proposed:

Our proposal avoids this attack vector:

  • It does not change the current semantics of CARv1. Zero-length blocks remain invalid.
  • Instead, we treat the response body as a new container format combining the CARv1 file with additional data.
  • Clients must explicitly request this new container format. Existing clients not aware of the new metadata will not receive responses in the new format.

When developers use existing tooling, they will never receive a CAR file with the 0x00 suffix.

There are two major ways how a CAR with a 0x00 suffix can emerge:

  1. Somebody makes an HTTP request to a Trustless Gateway, explicitly asks to receive CAR with meta=eof+json, saves the response body to a .car file and forgets to extract the CAR payload from the container (remove the \x00{metadata} trailer).

  2. Somebody uses a tool that is aware of meta=eof+json. The tool opts into this new feature when requesting content from a Trustless Gateway, but does not extract the CAR payload from the container in the response body before returning the content back to the user.

I am arguing that (2) is a bug in the tooling, introduced by the change that modified Trustless Gateway requests to opt-into meta=eof+json, and therefore, the maintainers of that tool should fix that bug - make the tool adhere to spec.

Regarding (1): do you think this will happen frequently enough to justify the effort required to change all libraries you mentioned to start ignoring the 0x00 byte?

Maybe it's actually a good thing that the tooling reports an error because it tells the user they are using the new meta=eof+json feature incorrectly.

As an alternative to silently stripping the 0x00 suffix, the tooling can detect the situation where 0x00 is followed by a valid DAG-JSON object and report a more helpful error message to the user, advising them to either change the "accept" header in the request to the Trustless Gateway or else remove the 0x00 suffix (unpack CARv1 from the container format).

Thoughts?


go-car/cmd/car/inspect.go seems to always treat 0x00 as EOF, if I am reading the source code correctly:

https://github.com/ipld/go-car/blob/5c5d432d582564f88fd2124f2fce4f2f3e47a654/cmd/car/inspect.go#L26

	rd, err := carv2.NewReader(inStream, carv2.ZeroLengthSectionAsEOF(true))

js-car seems to always reject zero-length blocks:

https://github.com/ipld/js-car/blob/562c39266edda8422e471b7f83eadc8b7362ea0c/src/decoder.js#L94-L97

  let length = decodeVarint(await reader.upTo(8), reader)
  if (length === 0) {
    throw new Error('Invalid CAR section (zero length)')
  }

I guess I can test how existing tooling handles zero-length blocks and document this behaviour in the IPIP, so that we better understand the current landscape.

Comment on lines +368 to +371
"b3checksum": {
"description": "A Blake3 hash (checksum) of the CAR stream (excluding the 0x00 byte and the metadata block). The value should be serialized as a multihash with multibase prefix, preferably using Base58 encoding.",
"type": "string"
},
Copy link
Member

@lidel lidel Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos What is the difference between car_cid and this field?

Hardcoding Blake3 in field name and description makes no sense if you use Multihash. It could use functions other than blake3 in the future.

To reduce future confusion, could this be renamed to car_checksum ? (and remove car_cid since it is redundant?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is your description of car_cid, see #431 (comment):

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

Regarding b3checksum:

For SPARK, we specifically need the Blake3 hash of the CAR stream, and we need gateways to always return this hash. In particular, clients cannot ask the server to use Blake3 for the CAR checksum because the server could use this information to detect SPARK clients vs. other clients and provide different quality of service.

I agree it's confusing to have both car_cid and b3checksum, but I don't see a better solution. Do you?

Comment on lines +356 to +359
"data_bytes": {
"description": "Total byte length of the flat file before it was encoded into a CAR file",
"type": "integer"
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos what happens when returned CAR is for:

  • HAMT-sharded UnixFS directory?
  • a single file under some sub-path of HAMT-sharded UnixFS directory?

Is the semantic meaning here to be "raw bytes of all files, ignoring UnixFS directory metadata", or something else?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great questions! TBH, I don't know the answers. We don't need data_bytes for SPARK. I think this field was added based on the discussion in this proposal, but I could not find the specific comment requesting it.

I am proposing to remove data_field from the spec. We can introduce it later if there is a clear need. We will better understand the desired semantics at that point.

Comment on lines +337 to +339
"sig": {
"type": "string",
"description": "A signature, using the server's Ed2559 identity, over the `data` object serialized as JSON."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bajtos

  • HTTP Gateways have no concept of "sever ED25519" introduced here. How one verifies the signature without knowing the pubkey?
    • One way to avoid being prescriptive about key type or its location, is to have sig_key with CID-encoded public libp2p-key that can be used for signature verification.
      • The nice thing about this is that Gateway/client implementation will already have relevant code/library as we use these in IPNS and libp2p.
  • If you sign JSON, you want it to be deterministic variant like DAG-JSON, otherwise someone will run into bugs when they use less strict JSON library in different languages.
Suggested change
"sig": {
"type": "string",
"description": "A signature, using the server's Ed2559 identity, over the `data` object serialized as JSON."
"sig_pubkey": {
"type": "string",
"description": "A libp2p-key used for signing"
},
"sig": {
"type": "string",
"description": "A signature, using the `sig_pubkey`, over the `data` object serialized as DAG-JSON."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is our use case:

  1. An untrusted/permissionless client makes a retrieval request to the Storage Provider's booster-http address advertised in IPNI.
  2. The client submits the measurement to the SPARK orchestration layer.
  3. Later, SPARK's evaluation service wants to verify that the client contacted the SP.

To do so, we must not accept signatures from any identity, only the signature from the identity advertised by SP to IPNI.

I am arguing this is true for everybody else who wants to use the signature to verify that a metadata block submitted by an untrusted party was indeed produced by the expected Trustless Gateway.

Consider a simple attack vector: the attacker takes the metadata block produced by the origin gateway and replaces the signature with one created using the attacker's identity. Clients verifying the signature against the sig_pubkey field in the metadata will not notice the attack.

Now I can see how including sig_pubkey can simplify troubleshooting:

  • If sig_pubkey does not match the pubkey we expected, then we know the metadata block was signed by somebody else
  • If sig_pubkey matches but the signature does not, then we know the metadata block was modified from the original.
    Compare that with my proposal:
  • If the signature is not valid, then either the metadata block was tampered with or it was signed by a different identity.

IMO, this improvement is not worth the cost of increasing metadata block size and, thus, egress traffic for Trustless Gateways.

Do you have any other use case for the signature in your mind?

IMO, the clients making retrieval requests don't need this signature for validating the metadata block, as they can rely on guarantees provided by the underlying transport - HTTPS.

  • HTTP Gateways have no concept of "server ED25519" introduced here.

Good point. We don't require all Gateways to sign the metadata block, SPARK needs the signature only from Storage Providers' servers handling retrieval (booster-http).

Let's update the spec to explicitly mention the signature is an optional field.

How one verifies the signature without knowing the pubkey?

  • One way to avoid being prescriptive about key type or its location, is to have sig_key with CID-encoded public libp2p-key that can be used for signature verification.

As I wrote above, if you don't know the expected server identity, then the signature is not useful for you.

Having said that, I like the idea of adding more details about the identity/public key to the spec.

The proposed format CID-encoded public libp2p-key seems like a good candidate, although AFAICT, that's not the format advertised to IPNI. In IPNI, I see identities in the format that can be used in multiaddr's /p2p/{id} part:

12D3KooWAWHEbCQy22d45mKbKSewoB1xksDDhR7o5S4mDrSNKXNk
12D3KooWAy5kaLtHf5uS7PZVLjSYd8sGqJ6fn7bxMjqLLZ1uULp9
12D3KooWEiPRcfjXJVehty8okJGJpBZP8zM5UBoCK5yw2MXfx98x
12D3KooWFpv7LP1MUmjfQ8sAUXgJXG5FRMJLnqnJyR32fVboqspB
12D3KooWHKeaNCnYByQUMS2n5PAZ1KZ9xKXqsb4bhpxVJ6bBJg5V
12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f
12D3KooWSfsqUahHLCmiENT8oN4FkVtz5pSCxKtNEb7wrR1rrRjk
  • If you sign JSON, you want it to be deterministic variant like DAG-JSON, otherwise someone will run into bugs when they use less strict JSON library in different languages.

Makes sense; I'll update the spec to require the metadata to be a DAG-JSON.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🏃 In Progress
Development

Successfully merging this pull request may close these issues.

None yet

6 participants