I’d strongly caution against many of those “performance tricks.” Spawning an asynchronous task on a separate thread, often with a heap-allocated handle, solely to deallocate a local object is a dubious pattern — especially given how typical allocators behave under the hood.
I frequently encounter use-cases akin to the “Sharded Vec Writer” idea, and I agree it can be valuable. But if performance is a genuine requirement, the implementation needs to be very different. I once attempted to build a general-purpose trait for performing parallel in-place updates of a Vec<T>, and found it extremely difficult to express cleanly in Rust without degenerating into unsafe or brittle abstractions.
> especially given how typical allocators behave under the hood.
To say more about it: nearly any modern high performance allocator will maintain a local (private) cache of freed chunks.
This is useful, for example, if you're allocating and deallocating about the same amount of memory/chunk size over and over again since it means you can avoid entering the global part of the allocator (which generally requires locking, etc.).
If you make an allocation while the cache is empty, you have to go to the global allocator to refill your cache (usually with several chunks). Similarly, if you free and find your local cache is full, you will need to return some memory to the global allocator (usually you drain several chunks from your cache at once so that you don't hit this condition constantly).
If you are almost always allocating on one thread and deallocating on another, you end up increasing contention in the allocator as you will (likely) end up filling/draining from the global allocator far more often than if you kept in on just one CPU. Depending on your specific application, maybe this performance loss is inconsequential compared to the value of not having to actually call free on some critical path, but it's a choice you should think carefully about and profile for.
This is exactly how C++/WinRT works, because people praising reference counting as GC algorithm often forget about possible stack overflows (if the destructor/Drop trait is badly written), or stop-the-world pauses when there is a domino effect of a reference reaching zero in a graph or tree structure.
So in C++/WinRT, which is basically the current C++ projection for COM and WinRT components, the framework moves the objects into a background thread before deletion, as such that those issues don't affect the performance of the main execution thread.
And given it is done by the same team, I would bet Rust/Windows-rs has the same optimization in place for COM/WinRT components.
Some allocators may even "hold" on to the freed (from another thread) memory, until the original thread deletes it (which is not the case here), or that thread dies, and then they go on "gc"-ing it.
I read this a few weeks ago, and was inspired to do some experiments of my own on compiler explorer, and found the following interesting things:
// This compiles down to a memmove() call
let my_vec: Vec<_> = my_vec.into_iter().skip(n).collect();
// this results in significantly smaller machine code than `v.retain(f)`
v.into_iter().filter(f).collect();
This was all with -C opt-level=2. I only looked at generated code size, didn't have time to benchmark any of these.
> Even if it were stable, it only works with slices of primitive types, so we’d have to lose our newtypes (SymbolId etc).
That's weird. I'd expect it to work with _any_ type, primitive or not, newtype or not, with a sufficiently simple memory layout, the rough equivalent of what C++ calls a "standard-layout type" or (formerly) a "POD".
I don't like magical stdlibs and I don't like user types being less powerful than built-in ones.
Clever workaround doing a no-op transformation of the whole vector though! Very nearly zero-cost.
> It would be possible to ensure that the proper Vec was restored for use-cases where that was important, however it would add extra complexity and might be enough to convince me that it’d be better to just use transmute.
Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.
> The optimisation in the Rust standard library that allows reuse of the heap allocation will only actually work if the size and alignment of T and U are the same
Shouldn't it work when T and U are the same size and T has stricter alignment requirements than U but not exactly the same alignment? In this situation, any U would be properly aligned because T is even more aligned.
> I'd expect it to work with _any_ type, primitive or not, newtype or not, with a sufficiently simple memory layout, the rough equivalent of what C++ calls a "standard-layout type" or (formerly) a "POD".
This might be related in part to the fact that Rust chose to create specific AtomicU8/AtomicU16/etc. types instead of going for Atomic<T> like in C++. The reasoning for forgoing the latter is [0]:
> However the consensus was that having unsupported atomic types either fail at monomorphization time or fall back to lock-based implementations was undesirable.
That doesn't mean that one couldn't hypothetically try to write from_mut_slice<T> where T is a transparent newtype over one of the supported atomics, but I'm not sure whether that function signature is expressible at the moment. Maybe if/when safe transmutes land, since from_mut_slice is basically just doing a transmute?
> Shouldn't it work when T and U are the same size and T has stricter alignment requirements than U but not exactly the same alignment? In this situation, any U would be properly aligned because T is even more aligned.
I think this optimization does what you say? A quick skim of the source code [1] seems to show that the alignments don't have to exactly match:
//! # Layout constraints
//! <snip>
//! Alignments of `T` must be the same or larger than `U`. Since alignments are always a power
//! of two _larger_ implies _is a multiple of_.
And later:
const fn in_place_collectible<DEST, SRC>(
step_merge: Option<NonZeroUsize>,
step_expand: Option<NonZeroUsize>,
) -> bool {
if const { SRC::IS_ZST || DEST::IS_ZST || mem::align_of::<SRC>() < mem::align_of::<DEST>() } {
return false;
}
// Other code that deals with non-alignment conditions
}
> Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.
Not really. Panics are supposed to be used in super exceptional situations, where the only course of action is to abort the whole unit of work you're doing and throw away all the resources. However you do have to be careful in critical code because things like integer overflow can also raise a panic.
> be careful in critical code because things like integer overflow can also raise a panic
So you can basically panic anywhere. I understand people have looked at no-panic markers (like C++ noexcept) but the proposals haven't gone anywhere. Consequently, you need to maintain the basic exception safety guarantee [1] at all times. In safe Rust, the compiler enforces this level of safety in most cases on its own, but there are situations in which you can temporarily violate program invariants and panic before being able to restore them. (A classic example is debiting from one bank account before crediting to another. If you panic in the middle, the money is lost.)
In unsafe Rust, you basically have the same burden of exception safety that C++ creates, except your job as an unsafe Rust programmer is harder than a C++ programmer's because Rust doesn't have a noexcept. Without noexcept, it's hard to reason about which calls can panic and which can't, so it's hard to make bulletproof cleanup paths.
Most Rust programmers don't think much about panics, so I assume most Rust programs are full of latent bugs of this sort. That's why I usually recommend panic=abort.
Rust's number types have functions like "wrapping_add" or "overflowing_add", which do not panic when overflowing and instead explicitly wrap around or return a result that must be checked.
You can easily write code that does not contain any possible panic points, if you want.
I don't think it's quite as easy to guarantee panic freedom as you think.
For example: do logging frameworks guarantee no-panic behavior? People can add logging statements practically anywhere, especially in a large team that maintains a codebase over significant time. One innocuous-looking debug log added to a section of code that's temporarily violated invariants can end up putting the whole program into a state, post-panic, in which those invariants no longer hold.
A lot of experience tells us that this happens in practice in C++, Java, Python, and other excpeption-ful languages. Maybe it happens less in Rust, but I'd be shocked if this class of bugs were absent.
Note that I'm talking about both safe and unsafe code. A safe section of code that panics unexpectedly might preserve memory safety invariants but hork the higher-level logical invariants of your application. You can end up with security vulnerabilities this way too.
Imagine an attacker who can force a panic in a network service, aborting his request but not killing the server, such that the panic on his next request grants him some kind of access he shouldn't have had due to the panic leaving the program in a bad state.
I'm not seeing Rust people take this problem as seriously as I think is warranted.
> A safe section of code that panics unexpectedly might preserve memory safety invariants but hork the higher-level logical invariants of your application
The usual way of dealing with this is to use impl Drop to cleanup properly. Resources are guaranteed to be dropped as expected on panic unwinds. Eg the database transaction rolls back if dropped without committing.
> Imagine an attacker who can force a panic in a network service, aborting his request but not killing the server, such that the panic on his next request grants him some kind of access he shouldn't have had due to the panic leaving the program in a bad state.
You need to be more specific. Why would the web server be left in a bad state because of such panics (in safe rust). All the memory will be cleaned up, all the database transactions will be cleaned up, mutexes might get poisoned, but that's considered a bug and it'll just cause another panic the next time someone tries to lock the mutex.
Wouldn't a relational database help deal with this?
> Without noexcept, it's hard to reason about which calls can panic and which can't, so it's hard to make bulletproof cleanup paths.
Unsafe blocks usually contain very low level code, so you can understand what your code does very accurately. If the unsafe code calls a dependency which calls 150 other dependencies transitively, yeah, that's going to be pretty bad.
> Wouldn't a relational database help deal with this?
Sure. It's just a toy example. There are lots of real programs in which you temporarily violate invariants in the course of performing some task, then restore them after. Another example that comes to mind is search tree rotations.
> Unsafe blocks usually contain very low level code, so you can understand what your code does very accurately.
> However you do have to be careful in critical code because things like integer overflow can also raise a panic.
This is incorrect. Only in debug builds does it raise a panic. In release Rust has to make the performance tradeoff that C++ does and defines signed integer math to wrap 2’s complement. Only in debug will signed overflow panic. Unsigned math never panics - it’s always going to overflow 2’s complement.
Correctness in debug builds is important, isn't it?
That said, panic on integer overflow in debug builds is unfortunate behavior. Overflow should cause an abort, not a panic.
> make the performance tradeoff that C++ does and defines signed integer math to wrap 2’s complement
In C++, signed overflow is undefined behavior, not wraparound. This property is useful to the optimizer for things like inferring loop bounds. The optimizer has less flexibility in equivalent Rust code.
You can enable overflow panics in release build, so if you're a library, you have to play it safe because you don't know how people will build your library.
I don't like relying on (release-only) llvm optimizations for a number of reasons, but primarily a) they break between releases, more often than you'd think, b) they're part of the reason why debug builds of rust software are so much slower (at runtime) than release builds, c) they're much harder to verify (and very opaque).
For non-performance-sensitive code, sure, go ahead and rely on the rust compiler to compile away the allocation of a whole new vector of a different type to convert from T to AtomicT, but where the performance matters, for my money I would go with the transmute 100% of the time (assuming the underlying type was decorated with #[transparent], though it would be nice if we could statically assert that). It'll perform better in debug mode, it's obvious what you are doing, it's guaranteed not to break in a minor rustc update, and it'll work with &mut [T] instead of an owned Vec<T> (which is a big one).
The particular optimisation for non-copying Vec -> IntoIter -> Vec transform is actually hard coded in the standard library as a special case of collecting an Iterator into a Vec. It doesn't rely on the backend for this.
Though this optimisation is treated as an implementation detail [1].
> Now that we have a Vec with no non-static lifetimes, we can safely move it to another thread.
I liked most of the tricks but this one seems pointless. This is no different than transmute as accessing the borrower requires an assume_init which I believe is technically UB when called on an uninit. Unless the point is that you’re going to be working with Owned but want to just transmute the Vec safely.
Overall I like the into_iter/collect trick to avoid unsafe. It was also most of the article, just various ways to apply this trick in different scenarios. Very neat!
You misunderstood the purpose of that trick. The vector is not going to be accessed again, the idea is to move it to another thread so it can be dropped in parallel (never accessed).
Every one of these "performance tricks" is describing how to convince rust's borrow checker that you're allowed to do a thing. It's more like "performance permission slips".
You don't have to play this game - you can always write within unsafe { ... } like in plain old C or C++. But people do choose to play this game because it helps them to write code that is also correct, where "correct" has an old-school meaning of "actually doing what it is supposed to do and not doing what it's not supposed to".
Software is built on abstractions - if all your app code is written without unsafe and you have one low level unsafe block to allow for something, you get the value of rust for all your app logic and you know the actual bug is in the unsafe code
This is like saying there’s no point having unprivileged users if you’re going to install sudo anyway.
The point is to escalate capability only when you need it, and you think carefully about it when you do. This prevents accidental mistakes having catastrophic outcomes everywhere else.
I think sudo is a great example. It's not much more secure than just logging in at root. It doesn't really protect malicious attackers in practice. And it's more of an annoyance than it protects against accidental mistakes in practice.
Unsafe isn’t a security feature per se. I think this is where a lot of the misunderstanding comes from.
It’s a speed bump that makes you pause to think, and tells reviewers to look extra closely. It also gives you a clear boundary to reason about: it must be impossible for safe callers to trigger UB in your unsafe code.
That's my point; I think after a while you instinctly repeat a command with sudo tacked on (see XKCD), and I wonder if I'm any safer from myself like that?
I'm doubtful that those boundaries that you mention really work so great. I imagine that in practice you can easily trigger faulty behaviours in unsafe code from within safe code. Practical type systems are barely powerful enough to let you inject a proof of valid-state into the unsafe-call. Making a contract at the safe/unsafe boundary statically enforceable (I'm not doubting people do manage to do it in practice but...) probably requires a mountain of unessential complexity and/or runtime checks and less than optimal algorithms & data structures.
> That's my point; I think after a while you instinctly repeat a command with sudo tacked on (see XKCD), and I wonder if I'm any safer from myself like that?
We agree that this is a dangerous / security-defeating habit to develop.
If someone realizes they're developing a pattern of such commands, it might be worth considering if there's an alternative. Some configuration or other suid binary which, being more specialized or tailor-purpouse, might be able to accomplish the same task with lower risk than a generalized sudo command.
This is often a difficult task.
Some orgs introduce additional hurdles to sudo/admin access (especially to e.g. production machines) in part to break such habits and encourage developing such alternatives.
> unsafe
There are usually safe alternatives.
If you use linters which require you to write safety documentation every time you break out an `unsafe { ... }` block, and require documentation of preconditions every time you write a new `unsafe fn`, and you have coworkers who will insist on a proper soliloquy of justification every time you touch either?
The difficult task won't be writing the safe alternative, it will be writing the unsafe one. And perhaps that difficulty will sometimes be justified, but it's not nearly so habit forming.
Because only lines marked with unsafe are suspicious, instead of every line of code.
Also the community culture matters, even though static analysis exists for C since 1979, it is still something we need to force feed many developers on C and C++ world.
This is an issue that you would face in any language with strong typing.
It only rears its head in Rust because Rust tries to give you both low-level control and strong types.
For example, in something like Go (which has a weaker type system than Rust), you wouldn't think twice about, paying for the re-allocation in buffer-reuse example.
Of course, in something like C or C++ you could do these things via simple pointer casts, but then you run the risk of violating some undefined behaviour.
In C I wouldn't use such a fluffy high-level approach in the first place. I wouldn't use contiguous unbounded vec-slices. And no, I wouldn't attempt trickery with overwriting input buffers. That's a bad inflexible approach that will bite at the next refactor. Instead, I would first make sure there's a way to cheaply allocate fixed size buffers (like 4 K buffers or whatever) and stream into those. Memory should be used in a allocate/write-once/release fashion whenever possible. This approach leads to straightforward, efficient architecture and bug-free code. It's also much better for concurrency/parallelism.
> In C I wouldn't use such a fluffy high-level approach in the first place.
Sure, though that's because C has abstraction like Mars has a breathable atmosphere.
> This approach leads to straightforward, efficient architecture and bug-free code. It's also much better for concurrency/parallelism.
This claim is wild considering that Rust code is more bug-free than C code while being just as efficient, while keeping in mind that Rust makes parallelism so much easier than C that it's stops being funny and starts being tragic.
I'm not even sure what it means for a language to "have" abstractions. Abstractions are created by competent software engineers, according to the requirements. A language can have features that make creating certain kinds of abstractions easier -- for example type-abstractions. I've stopped thinking that type abstractions are all that important. Somehow creating those always leads to decision paralysis and scope creep, and using those always leads to layers of bloat and less than straightforward programs.
The idea that a language can handle any complexity for you is an illusion. A language can automate a lot of the boring and repetitive small scale work. And it can have some of the things you would have otherwise coded yourself as built-ins. However you still have to deal with the complexity caused by buying into these built-ins. The larger a project gets the more likely the built-ins are to get in the way, and the more likely you are to rewrite these features or sub-systems yourself.
I'd say, more fully featured languages are most useful for the simpler side of projects (granted some of them can scale quite a way up with proficient use).
Now go research how some of the most complex, flexible, and efficient pieces of software are written.
> The idea that a language can handle any complexity for you is an illusion
I think this is wrong on its face. We wouldn't see any correlation between the language used and the highest complexity programs achieved it in.
As recently mentioned on HN it takes huge amounts of assembly to achieve anything at all, and to say that C doesn't handle any of the complexity you have to deal with when writing assembly to achieve the same result is absurd.
EDIT:
> Now go research how some of the most complex, flexible, and efficient pieces of software are written.
I'm quite aware. To say that the choice of say, C++ in the LLVM or Chromium codebase doesn't help deal with the complexities they operate over, and that C would do just as well at their scale... well, I don't think history bears that out.
No, C doesn't actually handle the _complexity_ of writing assembly. It abstracts and automates a lot of the repetitive work of doing register allocation etc -- sure. But these are very local issues -- I think it's fair to say that the complexity of a C program isn't really much lower than the equivalent program hand-coded in assembler.
I'm not sure that LLVM would be the first consideration for complex, flexible, efficient? It's quite certainly not fast, in particular linking isn't. I'm not sure about Chromium, it would be interesting to look at some of the more interesting components like V8, rendering engine, OS interfacing, the multimedia stack... and how they're actually written. I'd suspect the code isn't slinging shared_ptr's and unique_ptrs and lambdas and is keeping use of templates minimal.
I would have thought of the Linux kernel first and foremost. It's a truly massive architecture, built by a huge number of developers in a distributed fashion, with many intricate and highly optimized parts, impressive concurrency, scaling from very small machines to the biggest machines on the planet.
> I would have thought of the Linux kernel first and foremost. It's a truly massive architecture, built by a huge number of developers in a distributed fashion, with many intricate and highly optimized parts, impressive concurrency, scaling from very small machines to the biggest machines on the planet.
This is what the Linux kernel achieved, and when it started C was definitely the right choice for its (primary) implementation language. That doesn't mean C made the job easier, or that if Linux started today it would follow the same trajectory.
I would say Linux succeeds and has good abstractions due to strong discipline, in spite of what C provides it.
I've considered writing one. Why do you think what I'm describing here only applies to I/O (like syscalls)? And, more abstractly speaking -- isn't everything an I/O (like input/output) problem?
Because the data structures for a linker aren’t dealing with byte buffers. I’m pretty sure you’ve got symbolic graph structures that you’re manipulating. And as they mention, sometimes you have fairly large object collections that you’re allocating and freeing and fast 4k allocations don’t help you with that kind of stuff. Consider that the link phase for something like Chrome can easily use 1gib of active memory and you see why your idea will probably fall flat on its face for what Wild is trying to accomplish in terms of being super high performance state of the art linker.
No, without more context I don't immediately see how it would fall flat. If you're dealing with larger data, like Gigabytes of data, it might be better to use chunks of 1 MB or sth like that. My point is that stream processing is probably best done in chunks. Otherwise, you have to to fit all the transformed data in RAM at once -- even when it's ephemeral. Wasting Gigabytes without need isn't great.
Because you don’t know a priori whether any particular data isn’t needed anymore and linking is not a stream processing problem - you have to access any part of the “stream” multiple times randomly. Almost any part of the program can end up linking to another. So you do end up needing to keep it all in memory. And modern linkers need to do LTO which ends up needing to load all compilation units into memory at once to do that optimization.
But sure, if you’re that confident go for it. Writing a linker that can use a fraction of the memory, especially during LTO would be a ground breaking achievement.
From a skim, my impression was that there was a specific stream processing problem discussed but I probably have gotten that wrong, or a comment was edited. If I understand better now, in the OP, Rust streaming features are being used here to achieve essentially a typecast, and the streaming transformations (looking at each individual element) are supposed to be optimized away. And I restate that this is essentially solving a language-inflicted problem, not some interesting performance problem.
For sure, linking "graphs" generally have random connections so obviously I'm not saying that linking in general is a trivial stream processing problem.
Thinking about it though, linking graphs are _almost_ free of cyclic dependencies in practice (even when looking at entire chunks of code or whole object files as units), so there are some basic tricks we can use so we don't need to load all compilation units into RAM as I think you claimed we need to. I don't think it's necessary to load all the data in RAM.
The way I understand it, linking is essentially concatenating object files (or rather the sections contained in them) into a final executable. But while doing so, relocations must be applied, so object code must not be written before all symbols referenced by that code are known. This can be done in a granular fashion, based on fixed-size blocks. You collect all relocations in per-block lists (unsorted, think std::vector but it can be optimized). When a symbol or code location becomes known (i.e. looking at the library that has the required symbol), that information gets appended to the list. When a list is full (i.e. all relocations known) that block can be committed to the file while applying all fixups. Applying fixups is still random writes but only to fixed size blocks.
That way the problem is almost reduced to streaming fixed size blocks in practice. Typically, all symbols that some object file depends on are typically resolved by the object files that were already added before. So most new chunks that we're looking at can immediately be fixed up and be streamed to the output file. (In fact some linkers are kind of strict about the order of objects that you pass on the command line). What does grow more-or-less linearly though is the set of accumulated symbols (e.g., name + offset) as we add more and more objects files to the output executable.
I don't know a lot about LTO but it seems just like an extension of the same problem. Balancing memory consumption and link times with output performance is partly the user's job. Most projects should probably just disable LTO even for release builds. But making LTO fast in a linker is probably all about using the right approach, doing things in the right order, grouping work to avoid reads and writes (both RAM and disk).
> And I restate that this is essentially solving a language-inflicted problem, not some interesting performance problem.
Sort of yes, sort of no. Yes in that the main alternative (std::mem::transmute) is unsafe, so if you want to avoid dealing with that (e.g., "I personally like the challenge of doing things without unsafe if I can, especially if I can do so without loss of performance") then this can indeed be described as a language-inflicted problem. No in that this is arguably not an "inherent" language-inflicted problem, and there is work towards ways for the Rust compiler to guarantee that certain transmutes will work as expected in safe code, and I want to say that this would be one of those cases.
> in something like C or C++ you could do these things via simple pointer casts
No you don't. You explicitly start a new object lifetime at the address, either of the same type or a different type. There are standard mechanisms for this.
Developers that can't be bothered to do things correctly is why languages like Rust exist.
is UB in most cases (alignment aside, if Bar is not unsigned char, char, std::byte or a base class of Foo). This is obvious why, Foo and Bar may have constructors and destructors. You should use construct_at if you mean to;
For implicit-lifetimes types (iirc types with trivial default constructors (or are aggregates) plus trivial destructors), you can use memcpy, bit_cast and soon std::start_lifetime_as (to get a pointer) when it is implemented.
If I'm not mistaken, in C, the lifetime rules are more or less equivalent to implicitly using C++'s start_lifetime_as
Ironically, Rust doesn't need any of that, you literally can just cast to a different pointer type between arbitrary types and start using it without it inherently being UB (you know, as long as your access patterns are more generally valid).
That's because Rust doesn't have constructors/assignment operators, is it not? Because of that, all objects are trivially relocatable (sort of, because Rust allows destructors for these objects, I guess).
And strict aliasing is not a concern due to Rust's aliasing models, thus the combination of the two makes it safe to type-pun like that. But Rust's models has its downsides/is a tradeoff, so...
I don't particularly mind the C++ object model (since C++20), it makes sense after all: construct your object if it needs to, or materalize it through memcpy/bit_cast. std::start_lifetime_as should fix the last remaining usability issue about the model.
You're right that Rust doesn't have constructors or assignment operators, but its abstract machine also views memory as simply being a bag of bytes, rather than inherently typed like C/C++'s model.
When I read articles like this, I just relish how much Go, Zig, and Bun make my life to much easier in terms of solving performance issues with reasonable trade-offs.
It is more of a culture thing, most compiled languages have been fast enough for quite some time.
People using systems languages more often than not go down the rabbit hole of performance tuning, many times without a profiler, because still isn't the amount of ms that is supposed to be.
In reality unless one is writing an OS component, rendering engine, some kind of real time constrained code, or server code for "Webscale", the performance is more than enough for 99% of the use cases, in any modern compiler.
...Except that Rust is thread-safe, so expressing your algorithm in terms that the borrow checker accepts makes safe parallelism possible, as shown in the example using Rayon to trivially parallelize an operation. This is the whole point of Rust, and to say that C and C++ fail at thread-safety would be the understatement of the century.
If you're writing C code that shares memory between threads without some sort of synchronization primitive, then as your doctor I'm afraid I'm going to have to ask you to sit down, because I have some very bad news for you.
Yup -- yet another article only solving language level problems instead of teaching something about real constraints (i.e. hardware performance characteristics). Booooring. This kind of article is why I still haven't mustered the energy to get up to date with Rust. I'm still writing C (or C-in-C++) and having fun, most of the time feeling like I'm solving actual technical problems.
This was an article distilled from a talk at a Rust developers conference. Onbviously it’s going to make most sense to rust devs, and will seem unnecessary to non-Rust devs.
> It’d be reasonable to think that this will have a runtime cost, however it doesn’t. The reason is that the Rust standard library has a nice optimisation in it that when we consume a Vec and collect the result into a new Vec, in many circumstances, the heap allocation of the original Vec can be reused. This applies in this case. But what even with the heap allocation being reused, we’re still looping over all the elements to transform them right? Because the in-memory representation of an AtomicSymbolId is identical to that of a SymbolId, our loop becomes a no-op and is optimised away.
Those optimisations that this code relies on are literally undefined behaviour. The compiler doesn't guarantee it's gonna apply those optimisations. So your code might suddenly become super slow and you'll have to go digging in to see why. Is this undefined behaviour better than just having an unsafe block? I'm not so sure. The unsafe code will be easier to read and you won't need any comments or a blog to explain why we're doing voodoo stuff because the logic of the code will explain its intentions.
It cannot guarantee it for arbitrary iterators, but the map().collect() re-use is well known, and the machinery is there to do this, so while other implementations may not, rustc always will.
Basically, it is implementation-defined behavior. (If it were C/C++ it would be 'unspecified behavior' because rustc does not document exactly when it does this, but this is a very fine nitpick and not language Rust currently uses, though I'd argue it should.)
> So your code might suddenly become super slow and you'll have to go digging in to see why.
That's why wild has performance tests, to ensure that if a change breaks rustc's ability to optimize, it'll be noticed, and therefore fixed.
> That's why wild has performance tests, to ensure that if a change breaks rustc's ability to optimize, it'll be noticed, and therefore fixed.
But benchmarks won't tell us which optimisation suddenly stopped working. This looks so similar to the argument against UB to me. Something breaks, but you don't know what, where, and why.
I see. These optimisations might not be UB as understood in compiler lingo, but it is a kind of "undefined behaviour", as in anything could happen. And honestly the problems it might cause don't look that different from those caused by UB (from compiler lingo). Not to mention, using unsafe for writing optimised code will generate same-ish code in both debug and release mode, so DX will be better too.
As an example, parts of the C++ standard library (none of the core language I believe though) are covered by complexity requirements but implementations can still vary widely, e.g. std::sort needs to be linearithmic but someone could still implement a very slow version without it being UB (even if it was quadratic or something it still wouldn't be UB but wouldn't be standards conforming).
UB is really about the observable behavior of the abstract machine which is limited to the reads/writes to volatile data and I/O library calls [1]
I understand why Alexander Stepanov thought the complexity requirements were a good idea, but I am not convinced that in practice this delivers value. Worse, I don't see much sign C++ programmers care.
You mentioned particularly the C++ unstable sort std::sort. Famously although C++ 11 finally guarantees O(n log n) worst case complexity the libc++ stdlib didn't conform. They'd shipped worst case O(n squared) instead.
The bug report saying essentially "Hey, your sort is defective", was opened in 2014. By Orson Peters. It took until 2021 to fix it.
The optimization not getting applied doesn't mean that "anything could happen". Your code would just run slower. The result of this computation would still be correct and would match what you would expect to happen. This is the opposite of undefined behaviour, where the result is literally undefined, and, in particular, can be garbage.
I’d strongly caution against many of those “performance tricks.” Spawning an asynchronous task on a separate thread, often with a heap-allocated handle, solely to deallocate a local object is a dubious pattern — especially given how typical allocators behave under the hood.
I frequently encounter use-cases akin to the “Sharded Vec Writer” idea, and I agree it can be valuable. But if performance is a genuine requirement, the implementation needs to be very different. I once attempted to build a general-purpose trait for performing parallel in-place updates of a Vec<T>, and found it extremely difficult to express cleanly in Rust without degenerating into unsafe or brittle abstractions.
> especially given how typical allocators behave under the hood.
To say more about it: nearly any modern high performance allocator will maintain a local (private) cache of freed chunks.
This is useful, for example, if you're allocating and deallocating about the same amount of memory/chunk size over and over again since it means you can avoid entering the global part of the allocator (which generally requires locking, etc.).
If you make an allocation while the cache is empty, you have to go to the global allocator to refill your cache (usually with several chunks). Similarly, if you free and find your local cache is full, you will need to return some memory to the global allocator (usually you drain several chunks from your cache at once so that you don't hit this condition constantly).
If you are almost always allocating on one thread and deallocating on another, you end up increasing contention in the allocator as you will (likely) end up filling/draining from the global allocator far more often than if you kept in on just one CPU. Depending on your specific application, maybe this performance loss is inconsequential compared to the value of not having to actually call free on some critical path, but it's a choice you should think carefully about and profile for.
This is exactly how C++/WinRT works, because people praising reference counting as GC algorithm often forget about possible stack overflows (if the destructor/Drop trait is badly written), or stop-the-world pauses when there is a domino effect of a reference reaching zero in a graph or tree structure.
So in C++/WinRT, which is basically the current C++ projection for COM and WinRT components, the framework moves the objects into a background thread before deletion, as such that those issues don't affect the performance of the main execution thread.
And given it is done by the same team, I would bet Rust/Windows-rs has the same optimization in place for COM/WinRT components.
Some allocators may even "hold" on to the freed (from another thread) memory, until the original thread deletes it (which is not the case here), or that thread dies, and then they go on "gc"-ing it.
I read this a few weeks ago, and was inspired to do some experiments of my own on compiler explorer, and found the following interesting things:
This was all with -C opt-level=2. I only looked at generated code size, didn't have time to benchmark any of these.> Even if it were stable, it only works with slices of primitive types, so we’d have to lose our newtypes (SymbolId etc).
That's weird. I'd expect it to work with _any_ type, primitive or not, newtype or not, with a sufficiently simple memory layout, the rough equivalent of what C++ calls a "standard-layout type" or (formerly) a "POD".
I don't like magical stdlibs and I don't like user types being less powerful than built-in ones.
Clever workaround doing a no-op transformation of the whole vector though! Very nearly zero-cost.
> It would be possible to ensure that the proper Vec was restored for use-cases where that was important, however it would add extra complexity and might be enough to convince me that it’d be better to just use transmute.
Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.
> The optimisation in the Rust standard library that allows reuse of the heap allocation will only actually work if the size and alignment of T and U are the same
Shouldn't it work when T and U are the same size and T has stricter alignment requirements than U but not exactly the same alignment? In this situation, any U would be properly aligned because T is even more aligned.
> I'd expect it to work with _any_ type, primitive or not, newtype or not, with a sufficiently simple memory layout, the rough equivalent of what C++ calls a "standard-layout type" or (formerly) a "POD".
This might be related in part to the fact that Rust chose to create specific AtomicU8/AtomicU16/etc. types instead of going for Atomic<T> like in C++. The reasoning for forgoing the latter is [0]:
> However the consensus was that having unsupported atomic types either fail at monomorphization time or fall back to lock-based implementations was undesirable.
That doesn't mean that one couldn't hypothetically try to write from_mut_slice<T> where T is a transparent newtype over one of the supported atomics, but I'm not sure whether that function signature is expressible at the moment. Maybe if/when safe transmutes land, since from_mut_slice is basically just doing a transmute?
> Shouldn't it work when T and U are the same size and T has stricter alignment requirements than U but not exactly the same alignment? In this situation, any U would be properly aligned because T is even more aligned.
I think this optimization does what you say? A quick skim of the source code [1] seems to show that the alignments don't have to exactly match:
And later: [0]: https://github.com/Amanieu/rfcs/blob/more_atomic_types/text/...[1]: https://github.com/rust-lang/rust/blob/c58a5da7d48ff3887afe4...
> I think this optimization does what you say?
Cool. Thanks for checking! I guess the article should be tweaked a bit --- it states that the alignment has to match exactly.
> Great example of Rust being built such that you have to deal with error returns and think about C++-style exception safety.
Not really. Panics are supposed to be used in super exceptional situations, where the only course of action is to abort the whole unit of work you're doing and throw away all the resources. However you do have to be careful in critical code because things like integer overflow can also raise a panic.
> be careful in critical code because things like integer overflow can also raise a panic
So you can basically panic anywhere. I understand people have looked at no-panic markers (like C++ noexcept) but the proposals haven't gone anywhere. Consequently, you need to maintain the basic exception safety guarantee [1] at all times. In safe Rust, the compiler enforces this level of safety in most cases on its own, but there are situations in which you can temporarily violate program invariants and panic before being able to restore them. (A classic example is debiting from one bank account before crediting to another. If you panic in the middle, the money is lost.)
If you want that bank code to be robust against panics, you need to use something like https://docs.rs/scopeguard/latest/scopeguard/
In unsafe Rust, you basically have the same burden of exception safety that C++ creates, except your job as an unsafe Rust programmer is harder than a C++ programmer's because Rust doesn't have a noexcept. Without noexcept, it's hard to reason about which calls can panic and which can't, so it's hard to make bulletproof cleanup paths.
Most Rust programmers don't think much about panics, so I assume most Rust programs are full of latent bugs of this sort. That's why I usually recommend panic=abort.
[1] https://en.wikipedia.org/wiki/Exception_safety#Classificatio...
Rust's number types have functions like "wrapping_add" or "overflowing_add", which do not panic when overflowing and instead explicitly wrap around or return a result that must be checked.
You can easily write code that does not contain any possible panic points, if you want.
I don't think it's quite as easy to guarantee panic freedom as you think.
For example: do logging frameworks guarantee no-panic behavior? People can add logging statements practically anywhere, especially in a large team that maintains a codebase over significant time. One innocuous-looking debug log added to a section of code that's temporarily violated invariants can end up putting the whole program into a state, post-panic, in which those invariants no longer hold.
A lot of experience tells us that this happens in practice in C++, Java, Python, and other excpeption-ful languages. Maybe it happens less in Rust, but I'd be shocked if this class of bugs were absent.
Note that I'm talking about both safe and unsafe code. A safe section of code that panics unexpectedly might preserve memory safety invariants but hork the higher-level logical invariants of your application. You can end up with security vulnerabilities this way too.
Imagine an attacker who can force a panic in a network service, aborting his request but not killing the server, such that the panic on his next request grants him some kind of access he shouldn't have had due to the panic leaving the program in a bad state.
I'm not seeing Rust people take this problem as seriously as I think is warranted.
> A safe section of code that panics unexpectedly might preserve memory safety invariants but hork the higher-level logical invariants of your application
The usual way of dealing with this is to use impl Drop to cleanup properly. Resources are guaranteed to be dropped as expected on panic unwinds. Eg the database transaction rolls back if dropped without committing.
> Imagine an attacker who can force a panic in a network service, aborting his request but not killing the server, such that the panic on his next request grants him some kind of access he shouldn't have had due to the panic leaving the program in a bad state.
You need to be more specific. Why would the web server be left in a bad state because of such panics (in safe rust). All the memory will be cleaned up, all the database transactions will be cleaned up, mutexes might get poisoned, but that's considered a bug and it'll just cause another panic the next time someone tries to lock the mutex.
Sure, it's not trivial, but plenty of people who need this seem to do it.
https://crates.io/crates/no-panic
Huh. That's neat.
> If you panic in the middle, the money is lost.
Wouldn't a relational database help deal with this?
> Without noexcept, it's hard to reason about which calls can panic and which can't, so it's hard to make bulletproof cleanup paths.
Unsafe blocks usually contain very low level code, so you can understand what your code does very accurately. If the unsafe code calls a dependency which calls 150 other dependencies transitively, yeah, that's going to be pretty bad.
> Wouldn't a relational database help deal with this?
Sure. It's just a toy example. There are lots of real programs in which you temporarily violate invariants in the course of performing some task, then restore them after. Another example that comes to mind is search tree rotations.
> Unsafe blocks usually contain very low level code, so you can understand what your code does very accurately.
Perhaps at first, but code evolves.
> However you do have to be careful in critical code because things like integer overflow can also raise a panic.
This is incorrect. Only in debug builds does it raise a panic. In release Rust has to make the performance tradeoff that C++ does and defines signed integer math to wrap 2’s complement. Only in debug will signed overflow panic. Unsigned math never panics - it’s always going to overflow 2’s complement.
> Only in debug builds does it raise a panic.
Correctness in debug builds is important, isn't it?
That said, panic on integer overflow in debug builds is unfortunate behavior. Overflow should cause an abort, not a panic.
> make the performance tradeoff that C++ does and defines signed integer math to wrap 2’s complement
In C++, signed overflow is undefined behavior, not wraparound. This property is useful to the optimizer for things like inferring loop bounds. The optimizer has less flexibility in equivalent Rust code.
You can choose whether panics immediately abort, and you can also choose whether integer overflow panics in releas builds.
Personally I would often choose both, overflow panics and also panics abort, so if we overflow we blow up immediately.
What's the rationale behind aborting and not panicking in debug? Unwinding and flushing buffers seems like a better default with debug binaries.
You can enable overflow panics in release build, so if you're a library, you have to play it safe because you don't know how people will build your library.
I don't like relying on (release-only) llvm optimizations for a number of reasons, but primarily a) they break between releases, more often than you'd think, b) they're part of the reason why debug builds of rust software are so much slower (at runtime) than release builds, c) they're much harder to verify (and very opaque).
For non-performance-sensitive code, sure, go ahead and rely on the rust compiler to compile away the allocation of a whole new vector of a different type to convert from T to AtomicT, but where the performance matters, for my money I would go with the transmute 100% of the time (assuming the underlying type was decorated with #[transparent], though it would be nice if we could statically assert that). It'll perform better in debug mode, it's obvious what you are doing, it's guaranteed not to break in a minor rustc update, and it'll work with &mut [T] instead of an owned Vec<T> (which is a big one).
The particular optimisation for non-copying Vec -> IntoIter -> Vec transform is actually hard coded in the standard library as a special case of collecting an Iterator into a Vec. It doesn't rely on the backend for this.
Though this optimisation is treated as an implementation detail [1].
[1]: https://doc.rust-lang.org/stable/std/vec/struct.Vec.html#imp...
Thanks for pointing that out, TIL!
> Now that we have a Vec with no non-static lifetimes, we can safely move it to another thread.
I liked most of the tricks but this one seems pointless. This is no different than transmute as accessing the borrower requires an assume_init which I believe is technically UB when called on an uninit. Unless the point is that you’re going to be working with Owned but want to just transmute the Vec safely.
Overall I like the into_iter/collect trick to avoid unsafe. It was also most of the article, just various ways to apply this trick in different scenarios. Very neat!
You misunderstood the purpose of that trick. The vector is not going to be accessed again, the idea is to move it to another thread so it can be dropped in parallel (never accessed).
Every one of these "performance tricks" is describing how to convince rust's borrow checker that you're allowed to do a thing. It's more like "performance permission slips".
You don't have to play this game - you can always write within unsafe { ... } like in plain old C or C++. But people do choose to play this game because it helps them to write code that is also correct, where "correct" has an old-school meaning of "actually doing what it is supposed to do and not doing what it's not supposed to".
That just makes it seem like there's no point in using this language in the first place.
Dont let perfect be the enemy of good.
Software is built on abstractions - if all your app code is written without unsafe and you have one low level unsafe block to allow for something, you get the value of rust for all your app logic and you know the actual bug is in the unsafe code
This is like saying there’s no point having unprivileged users if you’re going to install sudo anyway.
The point is to escalate capability only when you need it, and you think carefully about it when you do. This prevents accidental mistakes having catastrophic outcomes everywhere else.
I think sudo is a great example. It's not much more secure than just logging in at root. It doesn't really protect malicious attackers in practice. And it's more of an annoyance than it protects against accidental mistakes in practice.
Unsafe isn’t a security feature per se. I think this is where a lot of the misunderstanding comes from.
It’s a speed bump that makes you pause to think, and tells reviewers to look extra closely. It also gives you a clear boundary to reason about: it must be impossible for safe callers to trigger UB in your unsafe code.
That's my point; I think after a while you instinctly repeat a command with sudo tacked on (see XKCD), and I wonder if I'm any safer from myself like that?
I'm doubtful that those boundaries that you mention really work so great. I imagine that in practice you can easily trigger faulty behaviours in unsafe code from within safe code. Practical type systems are barely powerful enough to let you inject a proof of valid-state into the unsafe-call. Making a contract at the safe/unsafe boundary statically enforceable (I'm not doubting people do manage to do it in practice but...) probably requires a mountain of unessential complexity and/or runtime checks and less than optimal algorithms & data structures.
> That's my point; I think after a while you instinctly repeat a command with sudo tacked on (see XKCD), and I wonder if I'm any safer from myself like that?
We agree that this is a dangerous / security-defeating habit to develop.
If someone realizes they're developing a pattern of such commands, it might be worth considering if there's an alternative. Some configuration or other suid binary which, being more specialized or tailor-purpouse, might be able to accomplish the same task with lower risk than a generalized sudo command.
This is often a difficult task.
Some orgs introduce additional hurdles to sudo/admin access (especially to e.g. production machines) in part to break such habits and encourage developing such alternatives.
> unsafe
There are usually safe alternatives.
If you use linters which require you to write safety documentation every time you break out an `unsafe { ... }` block, and require documentation of preconditions every time you write a new `unsafe fn`, and you have coworkers who will insist on a proper soliloquy of justification every time you touch either?
The difficult task won't be writing the safe alternative, it will be writing the unsafe one. And perhaps that difficulty will sometimes be justified, but it's not nearly so habit forming.
You are of course welcome to imagine whatever you want, but why not just try it for yourself?
What you postulate simply doesn’t match the actual experience of programming Rust
Because only lines marked with unsafe are suspicious, instead of every line of code.
Also the community culture matters, even though static analysis exists for C since 1979, it is still something we need to force feed many developers on C and C++ world.
This is an issue that you would face in any language with strong typing. It only rears its head in Rust because Rust tries to give you both low-level control and strong types.
For example, in something like Go (which has a weaker type system than Rust), you wouldn't think twice about, paying for the re-allocation in buffer-reuse example.
Of course, in something like C or C++ you could do these things via simple pointer casts, but then you run the risk of violating some undefined behaviour.
In C I wouldn't use such a fluffy high-level approach in the first place. I wouldn't use contiguous unbounded vec-slices. And no, I wouldn't attempt trickery with overwriting input buffers. That's a bad inflexible approach that will bite at the next refactor. Instead, I would first make sure there's a way to cheaply allocate fixed size buffers (like 4 K buffers or whatever) and stream into those. Memory should be used in a allocate/write-once/release fashion whenever possible. This approach leads to straightforward, efficient architecture and bug-free code. It's also much better for concurrency/parallelism.
> In C I wouldn't use such a fluffy high-level approach in the first place.
Sure, though that's because C has abstraction like Mars has a breathable atmosphere.
> This approach leads to straightforward, efficient architecture and bug-free code. It's also much better for concurrency/parallelism.
This claim is wild considering that Rust code is more bug-free than C code while being just as efficient, while keeping in mind that Rust makes parallelism so much easier than C that it's stops being funny and starts being tragic.
I'm not even sure what it means for a language to "have" abstractions. Abstractions are created by competent software engineers, according to the requirements. A language can have features that make creating certain kinds of abstractions easier -- for example type-abstractions. I've stopped thinking that type abstractions are all that important. Somehow creating those always leads to decision paralysis and scope creep, and using those always leads to layers of bloat and less than straightforward programs.
& and &mut are pretty fundamental Rust abstractions.
I can tell that were plenty of libraries doing that in the 2000's, back when it was common to write enterprise software in C.
Plenty of abstraction possible using TU as modules, and applying Abstract Data Types design, while following Yourdon structured method with C.
> straightforward, efficient architecture and bug-free code
The grace with which C handles projects of high complexity disagrees.
You get a simple implementation only by ignoring edge cases or improvements that increase complexity.
The idea that a language can handle any complexity for you is an illusion. A language can automate a lot of the boring and repetitive small scale work. And it can have some of the things you would have otherwise coded yourself as built-ins. However you still have to deal with the complexity caused by buying into these built-ins. The larger a project gets the more likely the built-ins are to get in the way, and the more likely you are to rewrite these features or sub-systems yourself.
I'd say, more fully featured languages are most useful for the simpler side of projects (granted some of them can scale quite a way up with proficient use).
Now go research how some of the most complex, flexible, and efficient pieces of software are written.
> The idea that a language can handle any complexity for you is an illusion
I think this is wrong on its face. We wouldn't see any correlation between the language used and the highest complexity programs achieved it in.
As recently mentioned on HN it takes huge amounts of assembly to achieve anything at all, and to say that C doesn't handle any of the complexity you have to deal with when writing assembly to achieve the same result is absurd.
EDIT: > Now go research how some of the most complex, flexible, and efficient pieces of software are written.
I'm quite aware. To say that the choice of say, C++ in the LLVM or Chromium codebase doesn't help deal with the complexities they operate over, and that C would do just as well at their scale... well, I don't think history bears that out.
No, C doesn't actually handle the _complexity_ of writing assembly. It abstracts and automates a lot of the repetitive work of doing register allocation etc -- sure. But these are very local issues -- I think it's fair to say that the complexity of a C program isn't really much lower than the equivalent program hand-coded in assembler.
I'm not sure that LLVM would be the first consideration for complex, flexible, efficient? It's quite certainly not fast, in particular linking isn't. I'm not sure about Chromium, it would be interesting to look at some of the more interesting components like V8, rendering engine, OS interfacing, the multimedia stack... and how they're actually written. I'd suspect the code isn't slinging shared_ptr's and unique_ptrs and lambdas and is keeping use of templates minimal.
I would have thought of the Linux kernel first and foremost. It's a truly massive architecture, built by a huge number of developers in a distributed fashion, with many intricate and highly optimized parts, impressive concurrency, scaling from very small machines to the biggest machines on the planet.
> I would have thought of the Linux kernel first and foremost. It's a truly massive architecture, built by a huge number of developers in a distributed fashion, with many intricate and highly optimized parts, impressive concurrency, scaling from very small machines to the biggest machines on the planet.
This is what the Linux kernel achieved, and when it started C was definitely the right choice for its (primary) implementation language. That doesn't mean C made the job easier, or that if Linux started today it would follow the same trajectory.
I would say Linux succeeds and has good abstractions due to strong discipline, in spite of what C provides it.
Have you written a linker before? It sounds like you’re describing I/O but that isn’t the work being done here by the linker.
I've considered writing one. Why do you think what I'm describing here only applies to I/O (like syscalls)? And, more abstractly speaking -- isn't everything an I/O (like input/output) problem?
Because the data structures for a linker aren’t dealing with byte buffers. I’m pretty sure you’ve got symbolic graph structures that you’re manipulating. And as they mention, sometimes you have fairly large object collections that you’re allocating and freeing and fast 4k allocations don’t help you with that kind of stuff. Consider that the link phase for something like Chrome can easily use 1gib of active memory and you see why your idea will probably fall flat on its face for what Wild is trying to accomplish in terms of being super high performance state of the art linker.
No, without more context I don't immediately see how it would fall flat. If you're dealing with larger data, like Gigabytes of data, it might be better to use chunks of 1 MB or sth like that. My point is that stream processing is probably best done in chunks. Otherwise, you have to to fit all the transformed data in RAM at once -- even when it's ephemeral. Wasting Gigabytes without need isn't great.
Because you don’t know a priori whether any particular data isn’t needed anymore and linking is not a stream processing problem - you have to access any part of the “stream” multiple times randomly. Almost any part of the program can end up linking to another. So you do end up needing to keep it all in memory. And modern linkers need to do LTO which ends up needing to load all compilation units into memory at once to do that optimization.
But sure, if you’re that confident go for it. Writing a linker that can use a fraction of the memory, especially during LTO would be a ground breaking achievement.
From a skim, my impression was that there was a specific stream processing problem discussed but I probably have gotten that wrong, or a comment was edited. If I understand better now, in the OP, Rust streaming features are being used here to achieve essentially a typecast, and the streaming transformations (looking at each individual element) are supposed to be optimized away. And I restate that this is essentially solving a language-inflicted problem, not some interesting performance problem.
For sure, linking "graphs" generally have random connections so obviously I'm not saying that linking in general is a trivial stream processing problem.
Thinking about it though, linking graphs are _almost_ free of cyclic dependencies in practice (even when looking at entire chunks of code or whole object files as units), so there are some basic tricks we can use so we don't need to load all compilation units into RAM as I think you claimed we need to. I don't think it's necessary to load all the data in RAM.
The way I understand it, linking is essentially concatenating object files (or rather the sections contained in them) into a final executable. But while doing so, relocations must be applied, so object code must not be written before all symbols referenced by that code are known. This can be done in a granular fashion, based on fixed-size blocks. You collect all relocations in per-block lists (unsorted, think std::vector but it can be optimized). When a symbol or code location becomes known (i.e. looking at the library that has the required symbol), that information gets appended to the list. When a list is full (i.e. all relocations known) that block can be committed to the file while applying all fixups. Applying fixups is still random writes but only to fixed size blocks.
That way the problem is almost reduced to streaming fixed size blocks in practice. Typically, all symbols that some object file depends on are typically resolved by the object files that were already added before. So most new chunks that we're looking at can immediately be fixed up and be streamed to the output file. (In fact some linkers are kind of strict about the order of objects that you pass on the command line). What does grow more-or-less linearly though is the set of accumulated symbols (e.g., name + offset) as we add more and more objects files to the output executable.
I don't know a lot about LTO but it seems just like an extension of the same problem. Balancing memory consumption and link times with output performance is partly the user's job. Most projects should probably just disable LTO even for release builds. But making LTO fast in a linker is probably all about using the right approach, doing things in the right order, grouping work to avoid reads and writes (both RAM and disk).
> And I restate that this is essentially solving a language-inflicted problem, not some interesting performance problem.
Sort of yes, sort of no. Yes in that the main alternative (std::mem::transmute) is unsafe, so if you want to avoid dealing with that (e.g., "I personally like the challenge of doing things without unsafe if I can, especially if I can do so without loss of performance") then this can indeed be described as a language-inflicted problem. No in that this is arguably not an "inherent" language-inflicted problem, and there is work towards ways for the Rust compiler to guarantee that certain transmutes will work as expected in safe code, and I want to say that this would be one of those cases.
> in something like C or C++ you could do these things via simple pointer casts
No you don't. You explicitly start a new object lifetime at the address, either of the same type or a different type. There are standard mechanisms for this.
Developers that can't be bothered to do things correctly is why languages like Rust exist.
And that is safer... how?
For implicit-lifetimes types (iirc types with trivial default constructors (or are aggregates) plus trivial destructors), you can use memcpy, bit_cast and soon std::start_lifetime_as (to get a pointer) when it is implemented.
If I'm not mistaken, in C, the lifetime rules are more or less equivalent to implicitly using C++'s start_lifetime_as
Ironically, Rust doesn't need any of that, you literally can just cast to a different pointer type between arbitrary types and start using it without it inherently being UB (you know, as long as your access patterns are more generally valid).
That's because Rust doesn't have constructors/assignment operators, is it not? Because of that, all objects are trivially relocatable (sort of, because Rust allows destructors for these objects, I guess).
And strict aliasing is not a concern due to Rust's aliasing models, thus the combination of the two makes it safe to type-pun like that. But Rust's models has its downsides/is a tradeoff, so...
I don't particularly mind the C++ object model (since C++20), it makes sense after all: construct your object if it needs to, or materalize it through memcpy/bit_cast. std::start_lifetime_as should fix the last remaining usability issue about the model.
You're right that Rust doesn't have constructors or assignment operators, but its abstract machine also views memory as simply being a bag of bytes, rather than inherently typed like C/C++'s model.
When I read articles like this, I just relish how much Go, Zig, and Bun make my life to much easier in terms of solving performance issues with reasonable trade-offs.
It is more of a culture thing, most compiled languages have been fast enough for quite some time.
People using systems languages more often than not go down the rabbit hole of performance tuning, many times without a profiler, because still isn't the amount of ms that is supposed to be.
In reality unless one is writing an OS component, rendering engine, some kind of real time constrained code, or server code for "Webscale", the performance is more than enough for 99% of the use cases, in any modern compiler.
...Except that Rust is thread-safe, so expressing your algorithm in terms that the borrow checker accepts makes safe parallelism possible, as shown in the example using Rayon to trivially parallelize an operation. This is the whole point of Rust, and to say that C and C++ fail at thread-safety would be the understatement of the century.
Memory safe doesn't imply memory efficient and reasonable. Wrapping everything in Arc is the opposite of simple & easy.
If you're writing C code that shares memory between threads without some sort of synchronization primitive, then as your doctor I'm afraid I'm going to have to ask you to sit down, because I have some very bad news for you.
I'm good, thanks.
Yup -- yet another article only solving language level problems instead of teaching something about real constraints (i.e. hardware performance characteristics). Booooring. This kind of article is why I still haven't mustered the energy to get up to date with Rust. I'm still writing C (or C-in-C++) and having fun, most of the time feeling like I'm solving actual technical problems.
This was an article distilled from a talk at a Rust developers conference. Onbviously it’s going to make most sense to rust devs, and will seem unnecessary to non-Rust devs.
The rayon thing is neat.
Just want to know some hacking tricks
> It’d be reasonable to think that this will have a runtime cost, however it doesn’t. The reason is that the Rust standard library has a nice optimisation in it that when we consume a Vec and collect the result into a new Vec, in many circumstances, the heap allocation of the original Vec can be reused. This applies in this case. But what even with the heap allocation being reused, we’re still looping over all the elements to transform them right? Because the in-memory representation of an AtomicSymbolId is identical to that of a SymbolId, our loop becomes a no-op and is optimised away.
Those optimisations that this code relies on are literally undefined behaviour. The compiler doesn't guarantee it's gonna apply those optimisations. So your code might suddenly become super slow and you'll have to go digging in to see why. Is this undefined behaviour better than just having an unsafe block? I'm not so sure. The unsafe code will be easier to read and you won't need any comments or a blog to explain why we're doing voodoo stuff because the logic of the code will explain its intentions.
> Those optimisations that this code relies on are literally undefined behaviour.
You cannot get undefined behavior in Rust without an unsafe block.
> The compiler doesn't guarantee it's gonna apply those optimisations.
This is a different concept than UB.
However, for the "heap allocation can be re-used", Rust does talk about this: https://doc.rust-lang.org/stable/std/vec/struct.Vec.html#imp...
It cannot guarantee it for arbitrary iterators, but the map().collect() re-use is well known, and the machinery is there to do this, so while other implementations may not, rustc always will.
Basically, it is implementation-defined behavior. (If it were C/C++ it would be 'unspecified behavior' because rustc does not document exactly when it does this, but this is a very fine nitpick and not language Rust currently uses, though I'd argue it should.)
> So your code might suddenly become super slow and you'll have to go digging in to see why.
That's why wild has performance tests, to ensure that if a change breaks rustc's ability to optimize, it'll be noticed, and therefore fixed.
> That's why wild has performance tests, to ensure that if a change breaks rustc's ability to optimize, it'll be noticed, and therefore fixed.
But benchmarks won't tell us which optimisation suddenly stopped working. This looks so similar to the argument against UB to me. Something breaks, but you don't know what, where, and why.
It is true that it won't tell you, for sure. It's just that UB means something very specific when discussing language semantics.
I see. These optimisations might not be UB as understood in compiler lingo, but it is a kind of "undefined behaviour", as in anything could happen. And honestly the problems it might cause don't look that different from those caused by UB (from compiler lingo). Not to mention, using unsafe for writing optimised code will generate same-ish code in both debug and release mode, so DX will be better too.
As an example, parts of the C++ standard library (none of the core language I believe though) are covered by complexity requirements but implementations can still vary widely, e.g. std::sort needs to be linearithmic but someone could still implement a very slow version without it being UB (even if it was quadratic or something it still wouldn't be UB but wouldn't be standards conforming).
UB is really about the observable behavior of the abstract machine which is limited to the reads/writes to volatile data and I/O library calls [1]
[1] http://open-std.org/jtc1/sc22/open/n2356/intro.html
Edit: to clarify the example
I understand why Alexander Stepanov thought the complexity requirements were a good idea, but I am not convinced that in practice this delivers value. Worse, I don't see much sign C++ programmers care.
You mentioned particularly the C++ unstable sort std::sort. Famously although C++ 11 finally guarantees O(n log n) worst case complexity the libc++ stdlib didn't conform. They'd shipped worst case O(n squared) instead.
The bug report saying essentially "Hey, your sort is defective", was opened in 2014. By Orson Peters. It took until 2021 to fix it.
The optimization not getting applied doesn't mean that "anything could happen". Your code would just run slower. The result of this computation would still be correct and would match what you would expect to happen. This is the opposite of undefined behaviour, where the result is literally undefined, and, in particular, can be garbage.
You're misusing the term "undefined behavior". You can certainly say that these kinds of performance optimizations aren't guaranteed.