I’d strongly caution against many of those “performance tricks.” Spawning an asynchronous task on a separate thread, often with a heap-allocated handle, solely to deallocate a local object is a dubious pattern — especially given how typical allocators behave under the hood.

I frequently encounter use-cases akin to the “Sharded Vec Writer” idea, and I agree it can be valuable. But if performance is a genuine requirement, the implementation needs to be very different. I once attempted to build a general-purpose trait for performing parallel in-place updates of a Vec<T>, and found it extremely difficult to express cleanly in Rust without degenerating into unsafe or brittle abstractions.

> especially given how typical allocators behave under the hood.

To say more about it: nearly any modern high performance allocator will maintain a local (private) cache of freed chunks.

This is useful, for example, if you're allocating and deallocating about the same amount of memory/chunk size over and over again since it means you can avoid entering the global part of the allocator (which generally requires locking, etc.).

If you make an allocation while the cache is empty, you have to go to the global allocator to refill your cache (usually with several chunks). Similarly, if you free and find your local cache is full, you will need to return some memory to the global allocator (usually you drain several chunks from your cache at once so that you don't hit this condition constantly).

If you are almost always allocating on one thread and deallocating on another, you end up increasing contention in the allocator as you will (likely) end up filling/draining from the global allocator far more often than if you kept in on just one CPU. Depending on your specific application, maybe this performance loss is inconsequential compared to the value of not having to actually call free on some critical path, but it's a choice you should think carefully about and profile for.

This is exactly how C++/WinRT works, because people praising reference counting as GC algorithm often forget about possible stack overflows (if the destructor/Drop trait is badly written), or stop-the-world pauses when there is a domino effect of a reference reaching zero in a graph or tree structure.

So in C++/WinRT, which is basically the current C++ projection for COM and WinRT components, the framework moves the objects into a background thread before deletion, as such that those issues don't affect the performance of the main execution thread.

And given it is done by the same team, I would bet Rust/Windows-rs has the same optimization in place for COM/WinRT components.

[deleted]

Some allocators may even "hold" on to the freed (from another thread) memory, until the original thread deletes it (which is not the case here), or that thread dies, and then they go on "gc"-ing it.