Very often if you have text, which this does, you can make huge savings by being intelligent with the text.
Rust intentionally provides the simplest possible growable string buffer String, which is literally (under the hood, you can't poke this legitimately) Vec<u8> plus the promise that this is UTF-8 text.
But you might find your needs better served by one (or several) of:
Box<str> -- you don't need capacity, so, don't store it => length == capacity
CompactString -- use the entire 24 bytes for SSO, up to 24 bytes of UTF-8 inline, obviously doesn't make sense if all or the vast majority of your strings are 25 bytes or longer
ColdString -- same idea but for 8 bytes, and also not storing capacity, this only makes sense over Box<str> if you have plenty of <= 8 byte strings
There's really an endless list of these optimizations. A few I've used (though not necessarily in rust):
Atoms: Each string can be referenced with a single u32 or even u16, and they're inherently deduplicated.
Bump allocator: your strings are &str, allocation is super fast with limited fragmentation.
Single pointer strings (this has a name, I can't think of it right now): you store the length inside the allocation instead of in each reference, so your strings are a single pointer.
Atoms: is this similar to interned strings?
The names of these things are hazy and inconsistent. In Java and C# an interned string is the same type as other strings. Others describe atoms as interned strings, some call them symbols. At my work we call the u16/u32 atoms and interned strings are the single pointer strings described above.
> Atoms: is this similar to interned strings?
Yes. It is exactly how they are described.
https://docs.rs/string_cache/latest/string_cache/struct.Atom...
> Represents a string that has been interned.
> There's really an endless list of these optimizations.
These aren't really optimizations. They are specialized implementations that introduce design and architectural tradeoffs.
For example, Rust's Atom represents a string that has been interned, and it's actually an implementation of a design pattern popular in the likes of Erlang/Elixir. This is essentially a specialized implementations of the old Flyweight design pattern, where managing N independent instances of an expensive read-only object is replaced with a singleton instance that's referenced through a key handle.
I would hardly call this an optimization. It actually represents a significant change to a system's architecture. You have to introduce a set of significant architectural constraints into your system to leverage a specific tradeoff. This isn't just a tweak that makes everything run magically leaner and faster.
> everything run magically leaner and faster
In my opinion, there's no magic in the software engineering. Everything (or almost everything) is a system that can be described, explained, modified and so on. Applications, libraries, operating systems, kernels, CPUs/RAM/GPU/NPU/xPU/whatever silicon there is, ALUs/etc, transistors, electricity, physics... That's nowhere near "magic". There's always some trade-offs, it's just that you may not be aware of them initially.
You might want to refresh your understanding of the word optimisation. Changing a system to be more effective/efficient is optimisation, how big that change is makes no difference.
> String, which is literally (under the hood, you can't poke this legitimately) Vec<u8>
`String::as_vec_mut` kinda implies that, since it gives you access to that underlying `Vec` which must then exist somewhere.
I looked it up: https://doc.rust-lang.org/std/string/struct.String.html#meth...
In case anyone else was wondering it, yes, it's "unsafe".
The thing they were gesturing at, correctly, is the naming. This is of course a convention and not a promise, but by convention Goose::as_crow would be a function that is cheap and gets you say &Crow instead of the &Goose you might have now, whereas Goose::to_donkey suggests that although we can have a Donkey instead of this Goose it's expensive to do that.
Commonly as... conversions are actually no-ops at runtime (the type changes but the data does not, no CPU instructions are emitted) whereas to... conversions might do quite a lot, especially if they bring into existence an actual thing at runtime -- maybe Goose::to_donkey actually needs to go allocate memory for a Donkey and destroy the Goose.
Yes it's unsafe because the Vec doesn't enforce the promise we made about this being UTF-8 text whereas String did, so now that promise is ours to keep and `unsafe` is how we signify that you the programmer took on the responsibility for safety here.
Yes, naming does play a role here, but the biggest hint is `as_vec_mut` returning a reference. For that to work a `Vec<u8>` needs to exist somewhere, and continue to exist after this function returns. For comparison, `to_` conversions generally just return the new data, so this reasoning doesn't apply to them.
What does Box<str> give you that &str doesn’t?
Ownership
CompactStr doesnt have any additional runtime overhead iirc right? So in theory you can drop it in everywhere even when you expect > 25 chars. Maybe an extra branch in the >25 char case?
SSO does have overhead. Firstly, on every access you have a branch. Secondly, and more severely, the "most general" umbrella type that all string methods are defined on is a string slice, and whereas conversion from `String` to `&str` is literally a no-op, SSO strings require work to be done to convert them to string slices. Furthermore, note that in the (surprisingly common) case where the string is zero-length, String already skips the allocation, same as an SSO string.
I wish Box and Option got specialized specialized shorthand syntax in Rust say `^`/ `? or something like that.
Box<T> used to be ~T early on in rust… (then it became a `box` keyword, before being removed entirely.) They got rid of it because they wanted to move more things into libraries and have a less opinionated compiler.
I think I agree though, especially with Option. Swift’s option syntax (and kotlin’s which is similar) is so much better, a simple question mark in the type. Options are important enough that dedicated syntax makes so much sense. Rust blew their chance here with ? meaning “maybe early return”, it would have been a lot more useful as an Option indicator.
I used to think this too, but nowadays I think this would have been a mistake. Result is actually the more important and fundamental type, not Option, and giving Option special syntax would cause people to want to favor it over Result even when it shouldn't be. If anything, I think that `?` should have been reserved as a suffix for function names (not types) that return Result, so that instead of `fn try_foo()` and `let x = try_foo();` we could have `fn foo?()` and `let x = foo?();`, and then the current `?` operator could just be spelled `.try`, akin to `.await`. (And then we maybe could go further and reserve `!` as a suffix for functions that can panic...)
The Try trait (representing the ? the operation) is super cool though! I wish it was marked stable so you could implement it for types without using the nightly compiler.
Note that both Option and Result implement that same trait.
Perhaps if try blocks ever become a thing... we can finally use it for our own types ;)
https://doc.rust-lang.org/std/ops/trait.Try.html
The modern Try trait (try_v2) is indeed wonderful and I hope to see it stabilized one day.
AIUI a key innovation is ControlFlow, reifying the Break/ Continue choice as a sum type in the type system. This is already stable and is a useful piece of vocabulary even without its contribution to understanding the Try trait.
Knowing that Bob's CircusPerformance trait and Sarah's SeaLion type both use ControlFlow to decide whether we should keep going or halt ASAP means you don't have to write fraught adaptor code because Bob thought obviously the boolean "true" means keep going while Sarah's understanding was that it's a signal about being finished, so "true" means stop.
For Try what ControlFlow did was unlock the difference between "Success / Failure" as encoded by Result::Ok and Result::Err and the "Halt / Carry on" distinction ControlFlow::Break and ControlFlow::Continue. Often we want to stop when there's an error, but sometimes we mean the exact opposite, carry on trying things until one of them succeeds.