Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompactOsString #244

Open
stepancheg opened this issue Jan 25, 2023 · 9 comments
Open

CompactOsString #244

stepancheg opened this issue Jan 25, 2023 · 9 comments

Comments

@stepancheg
Copy link

Add CompactOsString, which is inline OsString.

When working with filesystem, conversion to String is not always desirable, and using compact string would be helpful.

@NobodyXu
Copy link
Contributor

You may want to use other inline/small string crates instead for OsString.

CompactString assumes the encoding is utf-8 and use that knowledge to store 24 bytes of string inside the 24 bytes CompactString,

Without this utf-8 assumption, it will only able to store 16B on 64-bit target, just as other simple inline/small string out there.

@stepancheg
Copy link
Author

CompactOsString can do something like this: if file name is valid UTF-8, store it inline as UTF-8, otherwise, store it as bytes on Unix or as UTF-16 on Windows?

The use case is: read directory, store files names as CompactOsString. Some time later validate and convert to OsString. This will be cheap if both types are supported by the same crate using the same internal representation.

@NobodyXu
Copy link
Contributor

I think that's possible to do, but CompactOsString would then have to also store encoding information as a bitflag.
And this repo does not support utf-16 so that's going to take a lot of time to do.

With all these take into consideration, it seems that CompactOsString is going to be quite complicated.
Not impossible, just hard.

Even if @ParkMyCar decided to implement this, it would take quite some time before it's usable.

@stepancheg Why not simply use smallvec or something like that for this?
It's much simpler and can be used right now.

@stepancheg
Copy link
Author

stepancheg commented Jan 26, 2023

Why not simply use smallvec or something like that for this?

Because we already use CompactString. And because it would be good to have

  • low overhead of CompactString
  • cheap conversion between one and another

Anyway, that is just a feature request. I thought it would be appropriate for compact_str to have it. If not, that's fine.

@NobodyXu
Copy link
Contributor

Anyway, that is just a feature request. I thought it would be appropriate for compact_str to have it. If not, that's fine.

I'm not the maintainer of this crate and I'm just speaking from my personal experience and understanding of this crate.
It will be up to @ParkMyCar to decide whether to have this type.

Because we already use CompactString.

IMHO there's nothing wrong using another crate for OsStr/Path, for something entirely different from String.
Alternatively, if you are sure you only deal with utf-8 anyway, you can convert OsStr to str.

@ParkMyCar
Copy link
Owner

Hey @stepancheg! Thanks for filing an issue, and it's awesome to hear you're using compact_str :)

The idea of a CompactOsString is great, I think your suggestion of,

CompactOsString can do something like this: if file name is valid UTF-8, store it inline as UTF-8, otherwise, store it as bytes on Unix or as UTF-16 on Windows?

is pretty neat. It would add a slight bit of runtime overhead to determine if it's valid UTF-8, but ideally it would still be a net win since we don't need to heap allocate memory. I know simdjson has a very fast algorithm for UTF-8 validation (which they invented?), so we could also lean on that if necessary.

There is extra space in the discriminant of a CompactString so the state is represent-able, I also refactored the inner types of a CompactString recently to make them more basic "byte buffers" and pushed some the "string logic" a layer higher. Which naively would make implementing a CompactOsString easier. I can't commit to any timeline, but will definitely work on implementing it.

A CompactOsString leads me to also think about a CompactPath type, it seems like the two would naturally work well together. What are your thoughts?

@stepancheg
Copy link
Author

UTF-8 validation can be done simply by calling str::from_utf8. It may be not the best, but practically good enough. They have fast path for ASCII-only strings.

It can be further explicitly optimized knowing string data is well aligned like this:

union InlineStringData {
    bytes: [u8; 24],
    // Simplified: assuming 64 bit.
    words: [usize; 3],
}

impl InlineStringData {
    fn is_ascii(&self) -> bool {
        // This is cheap.
        self.words.iter().all(|&word| word & 0x80808080_80808080 == 0)
    }
    
    fn is_utf8(&self) -> bool {
        self.is_ascii() || std::str::from_utf8(&self.bytes).is_ok()
    }
}

Bonus point is, many os strings get eventually converted to regular strings anyway, and if we already validated it is UTF-8, such conversion would be free.

CompactPath makes sense too, but it is just a simple wrapper over CompactOsString.

@NobodyXu
Copy link
Contributor

@ParkMyCar If compact_str is going to support CompactOsString, maybe it is a good idea to have CompactUtf16String first and then have a CompactBytes which is for bytes, and all of them sharing the same layout?

IMO building CompactOsString will be simpler with these types implemented and it will be more of an incremental improvement instead of having one large PR.

@NobodyXu
Copy link
Contributor

P.S. If we are considering CompactOsString, then perhaps it is also a good idea to support CStr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants