Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Is the String kind conceptually a series of bytes or characters? #363

Open
tysonzero opened this issue Apr 20, 2021 · 15 comments
Open

Is the String kind conceptually a series of bytes or characters? #363

tysonzero opened this issue Apr 20, 2021 · 15 comments

Comments

@tysonzero
Copy link

It seems to me as though it should be a list of unicode character points, and serialization into bytes should be specified in the respective codecs.

However the section seems to talk a lot about byte-specific concerns.

@vmx
Copy link
Member

vmx commented Apr 20, 2021

There are more details about string in the data model document in the String kind section. There has been a long discussion about it and the conclusion was that string SHOULD (a strong RFC-like SHOULD) be UTF-8 (=> unicode characters).

@tysonzero
Copy link
Author

UTF-8 is a specific binary encoding though, which seems out of scope for a conceptual serialization-agnostic data model.

In my case I was planning on storing it in a Text type that is encoding agnostic, although technically stored internally as UTF-16, but that's solely an implementation detail.

I will of course serialize it to UTF-8 when encoding it to dag-cbor, but that's because of the cbor/dag-cbor specs and has nothing to do with the ipld-data-model spec.

Should I be storing it as a ByteString instead and try and make sure it's in valid utf-8 form? In practice that's a lot more of a pain in the ass, but I want to follow the spec.

@vmx
Copy link
Member

vmx commented Apr 20, 2021

UTF-8 is a specific binary encoding though, which seems out of scope for a conceptual serialization-agnostic data model.

Agreed. Sadly there isn't a sharp boundary between the conceptual level and then encoding. The intention is to say that strings should be something that can be converted to UTF-8 easily as this is what all major serialization formats I know of (like CBOR, Protocol Buffers, Ion etc) are using for their text representation. I think the current wording (with saying UTF-8) makes it easier for people who are not deeply into the text encoding business.

Should I be storing it as a ByteString instead and try and make sure it's in valid utf-8 form? In practice that's a lot more of a pain in the ass, but I want to follow the spec.

"storing" seems to refer to some intermediate step of your tooling. I don't think the spec does (or should) dictate which representation you use. If you follow the "SHOULD" about strings being Unicode, you can encode things internally in any way you want, it is only important that you can get valid UTF-8 in and out. Given you have UTF-16 has that property, you can keep them internally as UTF-16.

@warpfork
Copy link
Contributor

warpfork commented Apr 20, 2021

Strings should be considered as sequences of 8-bit bytes.

UTF-8 is a subset of that definition, and so is UTF-16 :) so defining the Data Model as being a superset of these is quite helpful.

There are also situations where people have used arbitrary bytes as map keys, and since we define map keys as "String", this means it's important to support arbitrary sequences of 8-bit bytes where we talk about "String".

You can use whatever representation you like in-memory of a program. If all bytes can be escaped into that internal representation, and round-trip, it's fine. I'd probably recommend choosing a type that's as close to raw bytes as possible, though, for sheer efficiency reasons -- flipping string encodings back and forth and back again is costly.

@tysonzero
Copy link
Author

The intention is to say that strings should be something that can be converted to UTF-8 easily as this is what all major serialization formats I know of (like CBOR, Protocol Buffers, Ion etc) are using for their text representation.

But if a protocol/codec for some reason wanted a non-UTF8 encoding, then that wouldn't be a problem if the abstraction layer was "unicode characters". However with the byte abstraction you're more or less locked into a single text encoding for all codecs indefinitely.

I think the current wording (with saying UTF-8) makes it easier for people who are not deeply into the text encoding business.

IMO something like "unicode string" or "sequence of unicode characters" would be a clearer abstraction and makes it clear that encoding is totally out of scope, and the responsibility of codecs.

Strings should be considered as sequences of 8-bit bytes.

UTF-8 is a subset of that definition, and so is UTF-16 :) so defining the Data Model as being a superset of these is quite helpful.

On the contrary I would actually say that it will make programs more buggy in practice. This thread is a perfect example:

One person tells me that my opaque text type is fine, as it can roundtrip UTF-8 just fine.

Another person tells me that UTF-16 being possible is a good thing, however my program will completely mangle any incoming UTF-16 due to the first person's advice.

There are also situations where people have used arbitrary bytes as map keys, and since we define map keys as "String", this means it's important to support arbitrary sequences of 8-bit bytes where we talk about "String".

Doesn't this more or less contradict the following?

As such, there is an element of "lowest common denominator" to the IPLD Data Model in that it cannot support some advanced features (like non-string keys for Maps) because support for such a feature is not common enough among programming languages.

For example as far as I'm aware using arbitrary bytestrings as keys in JavaScript objects/dicts isn't possible.

If all bytes can be escaped into that internal representation, and round-trip, it's fine.

So would you recommend not using the built in string type in Python, JavaScript or Haskell? All three of those would not round-trip bytes successfully.

It's also worth noting that CBOR specifically requires text strings to be UTF-8, so if non-UTF-8 bytes get converted to CBOR via DAG-CBOR, they would be in violation of the CBOR spec.

@warpfork
Copy link
Contributor

warpfork commented Apr 20, 2021

But if a protocol/codec for some reason wanted a non-UTF8 encoding, then that wouldn't be a problem if the abstraction layer was "unicode characters".

Hang on right there. That would be a problem, actually.

People often misunderstand this about unicode (and I say that without judgement, because I also misunderstood this about unicode until very recently), but: unicode does not guarantee all bytes are encodable. Unicode has significantly less expressiveness than the definition we get with "sequences of 8-bit bytes".

It is trivial to fit any of the unicode encodings inside "sequences of 8-bit bytes". The reverse is not true.

If this seems like an incredible claim, my favourite fixture to demonstrate it is the sequence which would be written in escaped hex as "\xC3\x21". This is a non-unicode sequence. The most reasonable interpretation of it is (in my opinion) "\xC3!", because the first byte cannot be interpreted as Unicode, but in a resynchronizing encoding like UTF-8, the subsequent bytes can still be interpreted (in this case, as an ascii-plane exclamation point). However, it is not Unicode. There is no unicode representation of this sequence.

@warpfork
Copy link
Contributor

warpfork commented Apr 20, 2021

So would you recommend not using the built in string type in Python, JavaScript or Haskell? All three of those would not round-trip bytes successfully.

Yes.

It sucks, but yes.

We've had to discuss this a lot around Rust, too. And there, we actually have some very interesting illustrations available, because Rust, being extremely diligent, also ran into this issue when defining their filesystem APIs. Filesystems, it turns out, do not generally guarantee UTF-8. (We all act like they do. Many filesystems do. But some don't. And if you try to enforce this, You Will Find That Out when you have Problems.)

So, what was the Rust solution around filesystems?

Make a "raw" "string" type that doesn't enforce UTF-8. It's just a sequence of 8-bit bytes. And if you want Rust's other String type that is UTF-8, well... you use this: https://doc.rust-lang.org/std/path/struct.Path.html#method.to_string_lossy . Notice how the method name itself even says "lossy", there. Aye.

What Rust is doing there is actually the honest truth at the bottom of the world.

It's unfortunate that, yes, the "String" types in some language's standard libraries do not make this easy. The answer is, as Rust did with Path, to make a String type for this purpose that contains sequences of 8-bit bytes. And make conversion methods -- which will be lossy; this is unavoidable -- to other string types as needed.

I feel like I should bend over backwards one more time here to say: yes, this sucks. Unfortunately, computers.

@tysonzero
Copy link
Author

unicode does not guarantee all bytes are encodable

Do you mean decodable? UTF-8/UTF-16 encode characters into bytes, but they operate on unicode characters, which are unrelated to any specific byte encoding.

Unicode has significantly less expressiveness than the definition we get with "sequences of 8-bit bytes".

Comparing the expressiveness of unicode and bytestrings seems pretty weird, as they have the exact same cardinality, and it's trivial to write injective functions in either direction (e.g. base64). The real question is what you are trying to model, characters or bytes.

\xC3\x21

I think we're miscommunicating a bit here. I was talking about non-UTF-8 encoding of unicode strings (e.g. UTF-16). The unicode string \xC3\x21 is not representable in the same way that the boolean 7 isn't.

For a concrete example of what I mean by problems with different unicode encodings. Let's look at the following concrete example.

I type "foo" into my IPLD text editor, I then transfer it over to an independent IPLD viewing software.

If the abstraction is unicode characters, then as long as the codec defines the right encoding to use, there are no problems, as the two programs both have to agree on the same codec.

Now on the other hand if the abstraction is bytes, then the editor might choose to store "foo" in UTF-8 as "\x66\x6f\x6f", so if the viewer were to use UTF-16, then it would garble the rendering.

@tysonzero
Copy link
Author

Given the existence of a true bytestring in the IPLD spec it seems to me as though that should be utilized instead of making text not a sequence of unicode characters.

I know that maps only accept text keys, but that seems to be a very intentional design decision ("...lowest common denominator...") which allowing for arbitrary bytes directly fights against, due to arbitrary byte keys being similarly non-universal (JavaScript, JSON).

If the priority is universality it seems like you should explicitly make sure maps have unicode character non-byte keys, and if flexibility is more important then it seems like you should just allow arbitrary IPLD values as keys, similar to CBOR.

I agree that it generally makes sense to interact with file paths as sequences of bytes, but that does not mean you need to use the string type to do so. For example in Haskell I would use the ByteString type instead of the Text type.

@tysonzero
Copy link
Author

tysonzero commented Apr 20, 2021

One point I think is worth emphasizing quite heavily is that if strings in IPLD are conceptually arbitrary sequences of bytes, then DAG-CBOR either violates the CBOR spec, or does not support the full IPLD data model.

Invalid UTF-8 is not a valid CBOR string, and CBOR parsing programs are welcome to reject them.

@rvagg
Copy link
Member

rvagg commented Apr 21, 2021

The state of play is this:

  • There exist in the wild, applications using arbitrary bytes in String fields for DAG-CBOR. Thanks to Go for making this so easy. Ref Filecoin which has baked this in in a couple of places.
  • There exist currently code which cannot properly read or round-trip these blocks because of the invalid UTF-8 character of the bytes. As of today you can't successfully parse all of the Filecoin DAG-CBOR blocks with our JavaScript codecs because of JavaScript's stricter handling of invalid UTF-8 in TextDecoder.

Therefore the recommendation goes something like this: if you value interop and less developer pain then just use valid UTF-8 in your DAG-CBOR String fields. It's not disallowed by our code (or by our specs) and it's a little hard to undo what's done. This is a SHOULD for the sake of interop and decreasing developer pain, but not a MUST because the ship has sailed.

We could be having the same argument about the map key sorting rules which were baked as it arose out of a RFC 7049 recommendation and whose sanity was not considered much at the time (and which recommendation has since been overridden by RFC 8949). It is what it is and we have to work with what's live unless we want to make a DAG-CBOR2.

@tysonzero
Copy link
Author

Thanks for the information. It is rather unfortunate that a decent amount of DAG-CBOR in the wild isn't valid CBOR.

I'll go with using a Text unicode string type that will always spit out valid utf-8, but will not successfully round trip invalid utf-8.

The CBOR library I use breaks on invalid UTF-8 CBOR, so it's difficult to go the other route anyway.

@rvagg
Copy link
Member

rvagg commented Apr 21, 2021

Being able to successfully read Filecoin blocks was, until recently, on my priority list and I had thought up some ways to get around it by digging right into the decoder, see #364 (comment) - in the case of @ipld/dag-cbor we have full control over the decoder and could force it to decode in weird ways if we wanted to - and in this instance it's enough of a problem that it might be worth adding an opt-in hack like this to get access to the underlying data.

@tysonzero
Copy link
Author

tysonzero commented Apr 21, 2021

That is an understandable need. The one downside is that it may encourage others to put non-UTF-8 data in their DAG-CBOR as well. Which will make generic CBOR libraries and services less useful for working with DAG-CBOR.

With regards to the map key ordering aspect, couldn't you specify that future usage should use the new ordering, but mention compatibility with old ordering systems. The new RFC 8949 spec mentions such a thing. That way it avoids proliferating obsolete CBOR.

On the documentation side what are your thoughts on something like:

Strings should be considered sequences of unicode characters, but they may be treated as sequences of mostly-utf-8 bytes for compatibility.

I think that should make it clear that using an opaque unicode type (that may or may not be utf-8) is fine as long as codecs are properly followed when encoding it. However this avoids closing the door on CBOR non-compliant DAG-CBOR that exists in the wild.

@vmx
Copy link
Member

vmx commented Apr 21, 2021

I'll go with using a Text unicode string type that will always spit out valid utf-8, but will not successfully round trip invalid utf-8.

@tysonzero This is what I was about to suggest. The current specification is heavily influenced by the Go implementation and the problem that the libraries used for Protocol Buffers and CBOR don't enforce the production of spec compliant data. As long as you use only spec compliant encoders/decoders for Protocol Buffers/CBOR/JSON, you won't have this problem and will only get strings which are a sequence of valid Unicode characters.

This is also what other IPLD implementations (e..g in JS or Rust) do, so they can use the native String types.

Given the existence of a true bytestring in the IPLD spec it seems to me as though that should be utilized instead of making text not a sequence of unicode characters.

Yes, exactly.

On the documentation side what are your thoughts on something like:

Strings should be considered sequences of unicode characters, but they may be treated as sequences of mostly-utf-8 bytes for compatibility.

I think that should make it clear that using an opaque unicode type (that may or may not be utf-8) is fine as long as codecs are properly followed when encoding it. However this avoids closing the door on CBOR non-compliant DAG-CBOR that exists in the wild.

That sounds good to me. I think the more we can emphasize that strings should really be a sequence of unicode characters and arbitrary bytes happen only due to non-spec compliant encoders/decoders the better.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants