Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safer Repr? #324

Open
goffrie opened this issue Oct 13, 2023 · 6 comments
Open

Safer Repr? #324

goffrie opened this issue Oct 13, 2023 · 6 comments

Comments

@goffrie
Copy link

goffrie commented Oct 13, 2023

I had an idea to avoid transmuting Reprs around by directly making use of the enum niche optimization. Basically, store ([u8; 23], LastUtf8Char) in Inline, restrict LastUtf8Char to only actual valid last-bytes for the Inline representation, then have Repr be an enum over Inline / Heap / &'static str / etc as desired, and the compiler will choose the discriminants for those variants.

See this Godbolt link.

@NobodyXu
Copy link
Contributor

That indeed works, mem::size_of::<Repr>() gives me 24.

@Kijewski
Copy link
Contributor

Seems like a great idea. I guess we can remove a ton of transmutations and the generated byte code should basically stay the same. And if the compiler better understands what we are doing, maybe it can even do more optimizations.

But I guess it would be quite a lot of work to implement that.

@goffrie
Copy link
Author

goffrie commented Oct 14, 2023

Working on this in https://github.com/goffrie/compact_str/tree/safer-repr.

@ParkMyCar
Copy link
Owner

Hey Geoffry! I love the idea of reducing the amount of unsafe code, thanks for filing an issue and starting to work on it! A couple of thoughts:

  1. Will we be able to retain the branchless string accesses? I'm not sure if Rust is smart enough to do this optimization for an enum and this is quite important for performance of CompactString.
  2. I see you added a ThinHeap representation, it looks like this is to support the cases when the Heap variant isn't large enough?

Related to this, I've been toying with a new layout for Repr and I'm curious what your thoughts are. First we'd drop the support for inlining 24 bytes, it's nice to squeeze one more byte on the stack, but it requires extra work in the hot path of string accesses, and shrinks the size of the string we can store on the heap. Instead the layout I'm thinking of would be:

struct Repr {
  ptr: *const (),
  len: usize,
  cap: NonZeroUsize,
}

We could represent our three states with:

  • Inline: most significant bit of the last byte starts with a 1
  • Heap: msb of the last byte starts with a 0
  • Static: msb of cap is a 0, followed by all 1's, aka a heap variant with MAX capacity

Now a CompactString essentially has the same heap capacity as String, (isize::MAX - 1 vs isize::MAX), and getting the length of the inline variant only requires a bit mask as opposed to a subtraction and a comparison. We also retain the property of size_of::<CompactString>() == size_of::<Option<CompactString>>() by using NonZeroUsize and enforcing we never create a heap variant with a capacity of 0, which we actually already do.

This definitely does not materially reduce the amount of unsafe code like your proposal does, but I wanted to bring it up since they both change the layout of Repr.

Note: it might be possible to retain storing 24 bytes inline and make this refactor, the tricky part is retaining the size optimization for Option<CompactString>. When choosing a niche Rust tracks just a minimum and maximum used value, and then uses something outside of those ranges. This essentially allows us to create a Non<user_specified_value>Usize, example. But, this fact isn't at all stable, so I'd be hesitant to utilize it in CompactString.

@mcronce
Copy link
Contributor

mcronce commented Oct 16, 2023

I'd be curious to see how big the execution time difference is for accesses with that change. It's also worth noting that, right now (in master, not any released v0.7), not only does size_of::<CompactString>() == size_of::<Option<CompactString>>(), there are more niches available - e.g. cervine::Cow<'_, CompactString, str> and Option<cervine::Cow<'_, CompactString, str>> are also 24 bytes on 64b platforms

@goffrie
Copy link
Author

goffrie commented Oct 23, 2023

Will we be able to retain the branchless string accesses? I'm not sure if Rust is smart enough to do this optimization for an enum and this is quite important for performance of CompactString.

Unlikely. I am curious to see how much performance this will lose.

I see you added a ThinHeap representation, it looks like this is to support the cases when the Heap variant isn't large enough?

Yup. It is quite a bit of extra complexity though just to support relatively unlikely use cases.

Losing one byte of inline space to get rid of the representation is probably fine, but it does kind of get rid of the biggest unique point about this crate over other similar ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants