IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway #431

bajtos · 2023-08-08T16:00:03Z

Define an optional enhancement of the CARv1 stream that allows a Gateway server to provide additional metadata about the CARv1 response. Introduce a new content type that allows the client and the server to signal or negotiate the inclusion of extra metadata.

~~The PR discussing a new multi-codec car-metadata: multiformats/multicodec#334~~

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>

bajtos · 2023-08-08T16:04:23Z

@lidel @willscott Here is my proposal for allowing gateway clients to request the response to include a metadata block at the end.

This is my first IPIP. Please let me know what and how to improve, where to add more details, etc. Feel free to edit the text directly if you like (edits by maintainers are allowed).

lidel

Thank you for submitting this @bajtos.

I like the framing of this as an extensible opt-in CAR manifest. Quick thought:

Best case, it will solve multiple problems (retrieval attestation, interrupted streams) without inventing anything new (reusing CARv1, DAG-CBOR, CAR content type parameters from Trustless Gateway spec).
Worst case, will have niche utility, but will create a standard for the ecosystem on how random metadata can be passed between paid HTTP services, allowing CAR-aware clients to identify it and strip it out before storing in caches.

Made some editorial tweaks + quick first-pass feedback in comments inline.

lidel · 2023-08-08T17:55:12Z

src/http-gateways/trustless-gateway.md

+- `b3h` - Blake3 hash (checksum) of the CAR data (excluding the metadata block).
+- `b3h_sig` - A signature over `<len><b3h><request>` using server's Ed2559 identity.
+  - `len` is encoded as `varint`,
+  - `b3h` is encoded as 32 bytes,
+  - The effective query as executed by the gateway. This query is the request url - path and query string arguments.


These b3* fields are specific to SPARK retrieval attestation and should not be listed in the trustless gateway spec as a MUST. These may be mandatory for SPARK, but are optional for the rest of IPFS ecosystem.

Please move them to "User benefit" section of the IPIP document and explain how meta=eof enables SPARK use case by allowing for these custom signatures to be passed along with the data. It makes a good example of extensibility that does not require PL's permission.

ps. I know other services like dagHouse use different hash functions for getting "CAR CID", putting all bets on Blake3 feels like an unnecessary divergence.

Perhaps this could be made bit more future-proof and generic if blake3 is represented as Multihash wrapped in CIDv1+car codec (0x0202)? Just an idea, fine to ignore, given these are specific to SPARK.

Either way, this belongs to the "userland benefiting from metadata extensibility" story.

Suggested change

- `b3h` - Blake3 hash (checksum) of the CAR data (excluding the metadata block).

- `b3h_sig` - A signature over `<len><b3h><request>` using server's Ed2559 identity.

- `len` is encoded as `varint`,

- `b3h` is encoded as 32 bytes,

- The effective query as executed by the gateway. This query is the request url - path and query string arguments.

I agree that the Spark use case belongs in Userland section. However, the individual keys of the metadata section, and what the servers must do to implement them, feels like something that should be in the trustless gateway spec.

Keys like car_bytes, data_bytes, block_count will be used by Spark but also may be used by others, and the definition of a key (i.e. what does the server actually return as the value for each key) must be the same for each use case. E.g. if one use case sets data_bytes to be the total byte length of blocks and another use case sets it to be the total byte length of the CAR stream then trustless gateway implementers will need to implement different logic for each different use case.

What's more, for the Spark use case, we do not want gateway operators to know that they are serving a Spark request and not some other request. Since the Spark ones will be incentivised and other request may not be, servers may simply provide a good retrieval service to Spark clients and a poor service to other clients.

car_bytes, data_bytes, block_count seem generic enough. The troublesome one is then the Blake3Hash and signature.

Perhaps what we need to do is leave this IPIP to concern the metadata block being appended without any constraints on what can be included in it. Then in a separate place, we define a canonical way to include a key value object in the metadata block and how the server should implement certain useful keys such as car_bytes et al

I think it would be ok to suggest a key name convention for generic things like car_bytes data_bytes and block_count in the section that described meta parameter, as long it is

scoped to JSON (perhaps list under explicit meta=eof[+json]?)

change requirement from MUST to SHOULD (convention, not a hard requirement)

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

+1 to document car_cid.

For SPARK, we specifically want a Blake3 hash so that we can use inclusion proofs. That's why we want to use a dedicated field b3checksum instead of a more generic car_cid.

lidel · 2023-08-08T18:00:26Z

src/http-gateways/trustless-gateway.md

+- `b3h` - Blake3 hash (checksum) of the CAR data (excluding the metadata block).
+- `b3h_sig` - A signature over `<len><b3h><request>` using server's Ed2559 identity.
+  - `len` is encoded as `varint`,
+  - `b3h` is encoded as 32 bytes,
+  - The effective query as executed by the gateway. This query is the request url - path and query string arguments.


Including content path and CAR export parameters feels generic enough to keep in the spec, but we should not mix content path with car and url parameters as it leads to bugs around things like percent-encoding especially where ? or / is involved (cc ipfs/gateway-conformance#115).

These should be three separate fields:

content_path - requested content path

dag_params - map with DAG params like dag-scope, entity-bytes from IPIP-402

car_params - map with CAR content type params like order and dups from IPIP-412

@lidel what is the reasoning behind splitting up dag_params and car_params? Could we instead go for content_path and query_params to keep it simple and generic (allowing for other query params)?

I believe that was a conscious design choice, to avoid mixing data selector with details of the transport format (not everything is in URL query params):

dag_params are about what user data was selected, not tied to any specific transport (could be applied to something other than CAR)

these are things that land in URL query

car_params are specific to CAR container format, they do not change the user data that was selected, only the way it is represented when sent as CAR

these are things that land in Accept/Content-Type headers

people read "query" and assume URL query :)

@patrickwoodhead that being said, if you want to simplify, this IPIP could go with a single dag-json map named response_params (to avoid confusion with URL query params, and account for the fact that server may ignore some of request params when producing a response)

How about using retrieval_params instead of response_params? In my mind, response parameters don't describe "what user data was selected".

What is our motivation in SPARK:
We want the gateway to describe what exactly the client is retrieving (CID, subpath, dag params, car params) and provide a signature over that.

If two SPARK checker clients submit a metadata block with the same retrieval parameters (CID, subpath, dag params, car params) then we want:

To be able to verify that the clients retrieved the DAG (sub)tree they were expected to retrieve.

Confidence that both clients made the same request to the gateway (semantically?) and were supposed to receive exactly the same response from the gateway.

content_path - requested content path

When the client requests GET /ipfs/bafy1234/cat.jpg, what is content_path?

/ipfs/bafy1234/cat.jpg

bafy1234/cat.jpg

/cat.jpg

something else?

We need the metadata block to describe both the CID requested (bafy1234) and the resource subpath if specified (/cat.jpg).

src/http-gateways/trustless-gateway.md

IPIP/0431-gateway-car-trailer.md

src/ipips/ipip-0431.md

rvagg · 2023-08-09T09:11:02Z

I don't think we're going to be able to do a new CID codec code for this (car-metadata), see discussion over in the multicodec thread. I don't really want to go into that here (there's been many keystrokes spilled over this question for a few years now). Unfortunately the limits of CARv1 keep on getting in our way and the focus on the blockstore use-case of CARv2 means it's not great as a transport format. But (aside from just doing CARv3 and fixing all the things) there's a few things we could consider:

CAR with special last block, but not a special codec

If you supply ?meta then you're opting in to a special format. Clients that do this, like Lassie, would then get to assume that the last block likely has special meaning—we just need to attempt to pass it through a well-defined schema (see the one I proposed in the multicodec thread) and if it matches then strip it from the CARv1 payload and do <special thing> with the contents.

Unfortunately this option mixes up the metadata in the payload it's trying to describe, but at least we have mechanisms to signal the special nature of it.

CAR with 0x00 end and metadata block following.

If we want to avoid the metadata being within the CARv1 payload. The ZeroLengthSectionAsEOF option that we have for go-car (I don't think we did it yet for js-car but it's straightforward) is already used inside Filecoin to delimit the end of a CAR body. We could make ?meta mean that you parse the CAR stream looking for the 0x00 || EOF to signal the end. Then any bytes beyond that can be whatever we need them to be—ideally a well-formed IPLD block conforming to a strict schema so we don't have opportunities for abuse.

One problem is that if you curl -o such a file with ?meta then you end up with a CAR that won't play nicely with most CAR tooling because ZeroLengthSectionAsEOF is a special library option (currently, we could add it to the go-car CLI).

CARv2 with "metadata" index

CARv2 isn't really designed as a transport format, that's why we're using CARv1, but we may be able to squish it into a usable form to help with this problem. We have a "characteristics" field we can play with, and there's an index section we can also play with.

We can define a new index format very easily, and the CAR decoders will just complain if they can't read the index, which is fine for this purpose. We could define an index as a "trustless gateway metadata index"; which doesn't get the location data, but we can use it to put whatever we want into the trailer of the CARv2 stream—we could just encode well-defined IPLD block strictly conforming to a schema, to present this metadata and anything else we want.

The main problem is that CARv2 requires a "DataSize" in the header to tell us the length of the CARv1 payload, which we don't have up-front, and an "IndexOffset" to tell us where the index starts, which we don't have for the same reasons. We've used IndexOffset==0 to signal no-index to date. But we could use the characteristics field to deal with these problems. We just need to parse the CARv1 payload with the ZeroLengthSectionAsEOF option turned on, which we can use as the signal for the end of the CARv1 data. We could also make that characteristic bit tell us that if IndexOffset==0 but there are more bytes after the payload, to interpret it as an index. Then the CAR decoder would fine our new index format, containing our metadata.

We get to leave the CARv1 intact, in the same form that you would get it if you didn't turn meta parameter on, and we get a ~valid CARv2 format file that could be read with a standard CAR decoder .. with the caveat that there's a couple of things here that will probably make existing parsers bork at it until we update the code and ship new versions. Arguably an extension of CARv2 rather than hard break to the format. Current CAR tooling may have problems, but we could produce new versions that are fine with these things in a way that doesn't have to compromise on anything else.

As an aside, we could use any of these options to do our error signalling, which I'm pretty keen on having. A schema for this metadata block could be a union type of the metadata presented here or an error string, so we can log and possibly pass on that information up the chain (still likely having to do the bad-chunk ending thing for cache reasons).

willscott · 2023-08-09T09:16:38Z

the keys / schema presented here, i think, should be considered an example. I would hope that it would be treated as an arbitrary key-value map of metadata objects, and that these key-values could be used to signal an error, could be used to signal an 'eof' signal, and/or could be used to provide additional check-sum attestation as described in the current text

rvagg · 2023-08-10T07:43:56Z

I'm prototyping a form of option 1 above with Frisbii and Lassie, will let you know how it goes.

Ref: ipfs/specs#431

src/http-gateways/trustless-gateway.md

src/ipips/ipip-0431.md

…m http fetches Ref: ipfs/specs#431 Ref: ipld/frisbii#15

rvagg · 2023-08-11T07:48:36Z

filecoin-project/lassie#378 and ipld/frisbii#15 demonstrate an approximation of the option 1 I presented above.

A meta=eof item in the Accept header is interpreted by the server as needing the trailing metadata to be sent, otherwise proceed as normal.
Server writes the CAR, then a NUL byte, then a dag-json (opted for dag-json because it's human readable in case someone inspects the output in confusion) metadata chunk that includes some information about the CAR stream. Response is sent with a Content-Type that includes meta=eof.
Client sees meta=eof in the Content-Type and reads the CAR using ZeroLengthSectionAsEOF; after that it slurps up the remaining bytes as dag-json into the metadata schema. It can then do <stuff> with it.

<stuff> that I've chosen to do at the moment is validate the car body according to the stats in the metadata and print out a "checksum" multihash that was provided by the server—this could be a signed checksum, or whatever it needs to be. There's also affordance for an error property in the metadata to signal that an error occurred and a message about it, so we get to see messages such as "block not found" now.

rvagg · 2023-08-11T10:11:13Z

I'll acknowledge that it may be better to just go with the plain map approach, without a schema, as Will's suggested. That would even let us do novel things for specific situations like having Lassie tell Saturn about retrieval clients and their timings (currently can only use Server-Timings header which ends when the data starts). But there's a bit of a can of worms that we open up that I wondered if we could avoid by having a strong schema, at least for the first version. All of the things that http headers have to deal with - like what to do with duplicate keys, what limits we need to put on the sizes of things to avoid abuse, etc. Constraining within the bounds of dag-json, which itself is a bit strict, and having a schema, let's us be very clear about rules and avoid abuse.
But, maybe it's better to basically replicate the Trailers section but in the payload. 🤷

lidel · 2023-08-17T22:08:50Z

@rvagg prefixing metadata with NUL changes the scope of this IPIP, effectively moves metadata outside CARv1 serialization format that we have specs and compatible implementations, which makes it harder to argue reuse of the same content type as CAR.

Existing CARv1 implementations will error without explicit support for ZeroLengthSectionAsEOF, right?

This thing starts looking like a new content type, changing the scope of this IPIP to something similar to application/vnd.ipld.car-stream from "Alternatives" section.

Not saying it is bad, maybe a separate content type for streaming CARs is the right call here. It mitigates risks around mixing regular CAR responses with ones that include metadata trailer and causing issues on clients that don't support ZeroLengthSectionAsEOF, and poisoning HTTP caches in scenarios when a rogue Saturn L1 sends a fishy CAR response.

But i'm worried about duplicated effort across teams and project in light of CARv3, which (iiuc) also needs to happen some time in the next ~12-24 months and might have overlapping scope, solving similar problems.

@willscott @bajtos is this IPIP something we intend to expose on all gateways and support forever in the IPFS ecosystem, even when we have CARv3? Would this be intended for wrapping CARv2 and v3 too? Or is this just a stop-gap for Rhea/Boost internally until we have CARv3 with built-in metadata/eof support?

Co-authored-by: Miroslav Bajtoš <oss@bajtos.net>

meta=eof+json update

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>

bajtos · 2023-10-18T14:05:50Z

Hello folks; thank you for your patience! Together with @patrickwoodhead, we incorporated your feedback and updated both the proposal and the spec.

We are ready for the next round of reviews. 🙏🏻

bajtos · 2023-10-18T14:07:36Z

src/http-gateways/trustless-gateway.md

+    "content_path": {
+      "description": "The url path in the request as executed by the gateway, e.g. `/ipfs/bafy1234/cat.jpg`. The query string MUST BE stripped from the path.",
+      "type": "string"
+    },


Discussion point leading back to #431 (comment):

How do we represent the information about what content was requested?

The CID

An optional path to a file inside UnixFS

bajtos · 2023-10-18T14:10:28Z

src/http-gateways/trustless-gateway.md

+    "data": {
+      "type": "object",
+      "description": "Properties of the response"
+    },


Discussion point:

In the current proposal, the top-level "data" object combines fields about "what was requested" (e.g. CAR & DAG params) with "what was returned" (e.g. CARv1 length in bytes).

I'd like to discuss an alternative: split data into two fields req and res. The first will describe what the client requested, the second will describe what the server returned.

Such division would allow us to shorten field names, e.g. data.car_params.dup can become req.dups.

Splitting into req and res sgtm, improves clarity

bajtos · 2023-10-18T14:15:35Z

src/http-gateways/trustless-gateway.md

+
+When the parameter is not set or does not equal `eof+json`, the server SHOULD not add any extra blocks to the response, neither the 0x00 byte nor any metadata.
+
+When `meta=eof+json`, the JSON object SHOULD conform to the following [JSON schema](https://json-schema.org/).


Discussion points:

In the current spec & IPIP, we are formatting metadata as JSON. Should we say DAG-JSON instead?

Do we want to serialise the metadata as a CAR block, prefixing the JSON data with varint | CID header?

@willscott @rvagg thoughts? Value added in DAG-JSON prefixed with own CID is that it allows client to detect truncation beyond 0x00 byte.

I believe clients can already easily detect truncation of the metadata block.

The block is a DAG-JSON object, it must start with { and end with a matching }.

If the block is truncated, it will not end with the matching } and the JSON parser will throw an error.

src/http-gateways/trustless-gateway.md

bajtos · 2023-10-18T14:20:40Z

src/ipips/ipip-0431.md

+TBD
+
+Using one CID, request the CAR data using various combinations of content type parameters.


Flagging this TODO to show in the PR discussion.

bajtos · 2023-10-18T14:21:03Z

src/ipips/ipip-0431.md

+- native truncation detection and standardized error handling and passing during streaming
+- support for things like [Large Blocks](https://discuss.ipfs.tech/t/supporting-large-ipld-blocks/15093/)
+
+TODO: link to some public artifact about CARv3


Flagging this TODO to show in the PR discussion.

Any suggestions for the artefacts I can link to?

@aschmahmann do we have anything on GH?

bajtos · 2023-10-18T14:21:48Z

Previews of the current version:

bajtos · 2023-11-06T11:22:40Z

@lidel @rvagg @willscott Ping 👋🏻 What's the best way to move this proposal forward?

lidel · 2023-11-07T01:17:42Z

src/ipips/ipip-0431.md

+
+- The metadata `sig` field SHOULD also be populated, returning a signature, using the server's Ed2559 identity, over the metadata properties object. This allows gateway clients to submit the metadata block as an attestation of retrieval that 3rd parties can verify.
+
+### Compatibility


@bajtos Let's go extra mile here and elaborate what happens when CAR response with 0x00-prefixed suffix is parsed by existing CAR software.

My suggestion is to add some clear statement about expected interop, like "libraries and implementations SHOULD ignore the suffix after 0x00", otherwise we will create a bad UX/DX, where developer tries to debug things with existign tooling and the tooling errors.

I imagine we don't want things to fail due to 0x00 suffix, bare minimum being:

>80% of Amino DHT IPFS network (including IPFS Desktop and Brave) is Kubo

ipfs dag import should ignore suffix

reference CAR libraries ignore 0x00 by default

js-car (JS library used by things like custom Service Workers, Helia)

go-car v1 and v2 (GO libraries)

Caveat: I think @rvagg mentioned this may not be possible, because of Filecoin-specific logic present in the library?

CLI tools we recommend to developers, they will try to use these for debugging CAR responses with the suffix:

ipfs-car (JS CLI)

car (CLI)

go-fixtureplate (CLI)

Let's go extra mile here and elaborate what happens when CAR response with 0x00-prefixed suffix is parsed by existing CAR software.

It's a great idea to think about compatibility with existing & future tooling and clearly describe our thinking. 👍🏻

The most important aspect is avoiding the "0x00 insertion attack" vector. You can find more details in the section Zero-length-block insertion attacks (including the Filecoin-specific logic). I am cross-posting the mitigation I proposed:

Our proposal avoids this attack vector:

It does not change the current semantics of CARv1. Zero-length blocks remain invalid.

Instead, we treat the response body as a new container format combining the CARv1 file with additional data.

Clients must explicitly request this new container format. Existing clients not aware of the new metadata will not receive responses in the new format.

When developers use existing tooling, they will never receive a CAR file with the 0x00 suffix.

There are two major ways how a CAR with a 0x00 suffix can emerge:

Somebody makes an HTTP request to a Trustless Gateway, explicitly asks to receive CAR with meta=eof+json, saves the response body to a .car file and forgets to extract the CAR payload from the container (remove the \x00{metadata} trailer).

Somebody uses a tool that is aware of meta=eof+json. The tool opts into this new feature when requesting content from a Trustless Gateway, but does not extract the CAR payload from the container in the response body before returning the content back to the user.

I am arguing that (2) is a bug in the tooling, introduced by the change that modified Trustless Gateway requests to opt-into meta=eof+json, and therefore, the maintainers of that tool should fix that bug - make the tool adhere to spec.

Regarding (1): do you think this will happen frequently enough to justify the effort required to change all libraries you mentioned to start ignoring the 0x00 byte?

Maybe it's actually a good thing that the tooling reports an error because it tells the user they are using the new meta=eof+json feature incorrectly.

As an alternative to silently stripping the 0x00 suffix, the tooling can detect the situation where 0x00 is followed by a valid DAG-JSON object and report a more helpful error message to the user, advising them to either change the "accept" header in the request to the Trustless Gateway or else remove the 0x00 suffix (unpack CARv1 from the container format).

Thoughts?

go-car/cmd/car/inspect.go seems to always treat 0x00 as EOF, if I am reading the source code correctly:

https://github.com/ipld/go-car/blob/5c5d432d582564f88fd2124f2fce4f2f3e47a654/cmd/car/inspect.go#L26

rd, err := carv2.NewReader(inStream, carv2.ZeroLengthSectionAsEOF(true))

js-car seems to always reject zero-length blocks:

https://github.com/ipld/js-car/blob/562c39266edda8422e471b7f83eadc8b7362ea0c/src/decoder.js#L94-L97

let length = decodeVarint(await reader.upTo(8), reader) if (length === 0) { throw new Error('Invalid CAR section (zero length)') }

I guess I can test how existing tooling handles zero-length blocks and document this behaviour in the IPIP, so that we better understand the current landscape.

lidel · 2023-11-07T01:28:26Z

src/http-gateways/trustless-gateway.md

+    "b3checksum": {
+      "description": "A Blake3 hash (checksum) of the CAR stream (excluding the 0x00 byte and the metadata block). The value should be serialized as a multihash with multibase prefix, preferably using Base58 encoding.",
+      "type": "string"
+    },


@bajtos What is the difference between car_cid and this field?

Hardcoding Blake3 in field name and description makes no sense if you use Multihash. It could use functions other than blake3 in the future.

To reduce future confusion, could this be renamed to car_checksum ? (and remove car_cid since it is redundant?)

Here is your description of car_cid, see #431 (comment):

I think it would be also ok to have a documented convention for passing a hash of the CAR stream (aka CAR CID) – maybe name it car_cid and use CIDv1 with 0x0202 codec – this convention is already used by .storage folks, no need to invent anything new.

Regarding b3checksum:

For SPARK, we specifically need the Blake3 hash of the CAR stream, and we need gateways to always return this hash. In particular, clients cannot ask the server to use Blake3 for the CAR checksum because the server could use this information to detect SPARK clients vs. other clients and provide different quality of service.

I agree it's confusing to have both car_cid and b3checksum, but I don't see a better solution. Do you?

lidel · 2023-11-07T01:35:32Z

src/http-gateways/trustless-gateway.md

+    "data_bytes": {
+      "description": "Total byte length of the flat file before it was encoded into a CAR file",
+      "type": "integer"
+    },


@bajtos what happens when returned CAR is for:

HAMT-sharded UnixFS directory?

a single file under some sub-path of HAMT-sharded UnixFS directory?

Is the semantic meaning here to be "raw bytes of all files, ignoring UnixFS directory metadata", or something else?

Great questions! TBH, I don't know the answers. We don't need data_bytes for SPARK. I think this field was added based on the discussion in this proposal, but I could not find the specific comment requesting it.

I am proposing to remove data_field from the spec. We can introduce it later if there is a clear need. We will better understand the desired semantics at that point.

lidel · 2023-11-07T01:49:22Z

src/http-gateways/trustless-gateway.md

+    "sig": {
+      "type": "string",
+      "description": "A signature, using the server's Ed2559 identity, over the `data` object serialized as JSON."


@bajtos

HTTP Gateways have no concept of "sever ED25519" introduced here. How one verifies the signature without knowing the pubkey?

One way to avoid being prescriptive about key type or its location, is to have sig_key with CID-encoded public libp2p-key that can be used for signature verification.

The nice thing about this is that Gateway/client implementation will already have relevant code/library as we use these in IPNS and libp2p.

If you sign JSON, you want it to be deterministic variant like DAG-JSON, otherwise someone will run into bugs when they use less strict JSON library in different languages.

Suggested change

"sig": {

"type": "string",

"description": "A signature, using the server's Ed2559 identity, over the `data` object serialized as JSON."

"sig_pubkey": {

"type": "string",

"description": "A libp2p-key used for signing"

},

"sig": {

"type": "string",

"description": "A signature, using the `sig_pubkey`, over the `data` object serialized as DAG-JSON."

Here is our use case:

An untrusted/permissionless client makes a retrieval request to the Storage Provider's booster-http address advertised in IPNI.

The client submits the measurement to the SPARK orchestration layer.

Later, SPARK's evaluation service wants to verify that the client contacted the SP.

To do so, we must not accept signatures from any identity, only the signature from the identity advertised by SP to IPNI.

I am arguing this is true for everybody else who wants to use the signature to verify that a metadata block submitted by an untrusted party was indeed produced by the expected Trustless Gateway.

Consider a simple attack vector: the attacker takes the metadata block produced by the origin gateway and replaces the signature with one created using the attacker's identity. Clients verifying the signature against the sig_pubkey field in the metadata will not notice the attack.

Now I can see how including sig_pubkey can simplify troubleshooting:

If sig_pubkey does not match the pubkey we expected, then we know the metadata block was signed by somebody else

If sig_pubkey matches but the signature does not, then we know the metadata block was modified from the original.
Compare that with my proposal:

If the signature is not valid, then either the metadata block was tampered with or it was signed by a different identity.

IMO, this improvement is not worth the cost of increasing metadata block size and, thus, egress traffic for Trustless Gateways.

Do you have any other use case for the signature in your mind?

IMO, the clients making retrieval requests don't need this signature for validating the metadata block, as they can rely on guarantees provided by the underlying transport - HTTPS.

HTTP Gateways have no concept of "server ED25519" introduced here.

Good point. We don't require all Gateways to sign the metadata block, SPARK needs the signature only from Storage Providers' servers handling retrieval (booster-http).

Let's update the spec to explicitly mention the signature is an optional field.

How one verifies the signature without knowing the pubkey?

One way to avoid being prescriptive about key type or its location, is to have sig_key with CID-encoded public libp2p-key that can be used for signature verification.

As I wrote above, if you don't know the expected server identity, then the signature is not useful for you.

Having said that, I like the idea of adding more details about the identity/public key to the spec.

The proposed format CID-encoded public libp2p-key seems like a good candidate, although AFAICT, that's not the format advertised to IPNI. In IPNI, I see identities in the format that can be used in multiaddr's /p2p/{id} part:

12D3KooWAWHEbCQy22d45mKbKSewoB1xksDDhR7o5S4mDrSNKXNk 12D3KooWAy5kaLtHf5uS7PZVLjSYd8sGqJ6fn7bxMjqLLZ1uULp9 12D3KooWEiPRcfjXJVehty8okJGJpBZP8zM5UBoCK5yw2MXfx98x 12D3KooWFpv7LP1MUmjfQ8sAUXgJXG5FRMJLnqnJyR32fVboqspB 12D3KooWHKeaNCnYByQUMS2n5PAZ1KZ9xKXqsb4bhpxVJ6bBJg5V 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f 12D3KooWSfsqUahHLCmiENT8oN4FkVtz5pSCxKtNEb7wrR1rrRjk

If you sign JSON, you want it to be deterministic variant like DAG-JSON, otherwise someone will run into bugs when they use less strict JSON library in different languages.

Makes sense; I'll update the spec to require the metadata to be a DAG-JSON.

bajtos added 2 commits August 8, 2023 16:58

IPIP: CAR meta (content type parameter)

2eb7b9e

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>

add IPIP/0431-gateway-car-trailer.md

3814b6a

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>

bajtos mentioned this pull request Aug 8, 2023

Retrieval Attestation filecoin-project/boost#1597

Closed

9 tasks

Merge branch 'main' into gw-car-metadata-trailer

cb1d8b3

This was referenced Aug 8, 2023

support CAR meta in booster-http filecoin-project/boost#1610

Open

propose car-metadata multicodec multiformats/multicodec#334

Closed

lidel added 2 commits August 8, 2023 21:39

chore: editora fixes, prep for HTML preview

1c0fbaa

chore: enable HTML preview

a7e75d7

lidel changed the title ~~IPIP: CAR meta (content type parameter)~~ IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway Aug 8, 2023

ipip-431: add upside to one of alternatives

68715c4

lidel requested changes Aug 8, 2023

View reviewed changes

rvagg added a commit to ipld/frisbii that referenced this pull request Aug 10, 2023

feat: add ?meta=eof for trailling metadata dag-json

70f9809

Ref: ipfs/specs#431

rvagg mentioned this pull request Aug 10, 2023

feat: add ?meta=eof for trailling metadata dag-json ipld/frisbii#15

Draft

rvagg added a commit to ipld/frisbii that referenced this pull request Aug 10, 2023

feat: add ?meta=eof for trailling metadata dag-json

27bae29

Ref: ipfs/specs#431

lidel reviewed Aug 10, 2023

View reviewed changes

src/http-gateways/trustless-gateway.md Outdated Show resolved Hide resolved

ipip-431: add CARv3 to Alternatives

ed86a0f

lidel reviewed Aug 10, 2023

View reviewed changes

src/ipips/ipip-0431.md Show resolved Hide resolved

rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023

feat: meta=eof for IPIP-431; ask for and expect (but not require) fro…

c8cc1b0

…m http fetches Ref: ipfs/specs#431 Ref: ipld/frisbii#15

rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023

feat: meta=eof for IPIP-431; ask for and expect (but not require) fro…

0393d52

…m http fetches Ref: ipfs/specs#431 Ref: ipld/frisbii#15

rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023

feat: meta=eof for IPIP-431; ask for and expect (but not require) fro…

26954e0

…m http fetches Ref: ipfs/specs#431 Ref: ipld/frisbii#15

rvagg mentioned this pull request Aug 11, 2023

feat: meta=eof for IPIP-431; ask for and expect (but not require) from http fetches filecoin-project/lassie#378

Draft

rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023

feat: meta=eof for IPIP-431; ask for and expect (but not require) fro…

d20a604

…m http fetches Ref: ipfs/specs#431 Ref: ipld/frisbii#15

rvagg added a commit to filecoin-project/lassie that referenced this pull request Aug 11, 2023

feat: meta=eof for IPIP-431; ask for and expect (but not require) fro…

edff039

…m http fetches Ref: ipfs/specs#431 Ref: ipld/frisbii#15

patrickwoodhead and others added 10 commits October 4, 2023 11:39

metadata schema update including json schema usage

9170c29

Update content path and query params to retrieval params

65ffcfc

Co-authored-by: Miroslav Bajtoš <oss@bajtos.net>

fixes from Miros feedback

9d1b61f

json schema wrapper around top level object

93c3c28

dag and car params

eacf51a

more alternatvies from discussion added

62fb207

Merge pull request #1 from bajtos/review-1-fixes

72ed04c

meta=eof+json update

add Patrick as a co-editor

e1fc296

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>

formatting cleanup, remove duplicate schema, describe attack vectors

152f4a6

Signed-off-by: Miroslav Bajtoš <oss@bajtos.net>

Merge branch 'main' into gw-car-metadata-trailer

b6069bf

bajtos marked this pull request as ready for review October 18, 2023 14:04

bajtos requested a review from a team as a code owner October 18, 2023 14:04

bajtos commented Oct 18, 2023

View reviewed changes

src/http-gateways/trustless-gateway.md Show resolved Hide resolved

bajtos commented Oct 18, 2023

View reviewed changes

bajtos requested review from lidel, patrickwoodhead and willscott October 18, 2023 14:21

lidel mentioned this pull request Nov 7, 2023

Escaping of quotes and <> in code blocks is broken ipfs/spec-generator#19

Closed

lidel reviewed Nov 7, 2023

View reviewed changes

lidel mentioned this pull request Apr 18, 2024

gateways: document Content-Location #471

Merged


		When the parameter is not set or does not equal `eof+json`, the server SHOULD not add any extra blocks to the response, neither the 0x00 byte nor any metadata.

		When `meta=eof+json`, the JSON object SHOULD conform to the following [JSON schema](https://json-schema.org/).

		TBD

		Using one CID, request the CAR data using various combinations of content type parameters.


		- The metadata `sig` field SHOULD also be populated, returning a signature, using the server's Ed2559 identity, over the metadata properties object. This allows gateway clients to submit the metadata block as an attestation of retrieval that 3rd parties can verify.

		### Compatibility

IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway #431

Are you sure you want to change the base?

IPIP-431: Opt-in Extensible CAR Metadata on Trustless Gateway #431

Conversation

bajtos commented Aug 8, 2023 • edited

bajtos commented Aug 8, 2023

lidel left a comment • edited

Choose a reason for hiding this comment

lidel Aug 8, 2023 • edited

Choose a reason for hiding this comment

patrickwoodhead Sep 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Sep 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Aug 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Sep 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rvagg commented Aug 9, 2023

willscott commented Aug 9, 2023

rvagg commented Aug 10, 2023

rvagg commented Aug 11, 2023

rvagg commented Aug 11, 2023

lidel commented Aug 17, 2023

bajtos commented Oct 18, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Nov 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bajtos commented Oct 18, 2023

bajtos commented Nov 6, 2023

lidel Nov 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lidel Nov 7, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bajtos commented Aug 8, 2023 •

edited

lidel left a comment •

edited

lidel Aug 8, 2023 •

edited

patrickwoodhead Sep 13, 2023 •

edited

lidel Sep 15, 2023 •

edited

lidel Aug 8, 2023 •

edited

lidel Sep 15, 2023 •

edited

lidel Nov 7, 2023 •

edited

lidel Nov 7, 2023 •

edited

lidel Nov 7, 2023 •

edited