Hacker News

> especially given how typical allocators behave under the hood.

To say more about it: nearly any modern high performance allocator will maintain a local (private) cache of freed chunks.

This is useful, for example, if you're allocating and deallocating about the same amount of memory/chunk size over and over again since it means you can avoid entering the global part of the allocator (which generally requires locking, etc.).

If you make an allocation while the cache is empty, you have to go to the global allocator to refill your cache (usually with several chunks). Similarly, if you free and find your local cache is full, you will need to return some memory to the global allocator (usually you drain several chunks from your cache at once so that you don't hit this condition constantly).

If you are almost always allocating on one thread and deallocating on another, you end up increasing contention in the allocator as you will (likely) end up filling/draining from the global allocator far more often than if you kept in on just one CPU. Depending on your specific application, maybe this performance loss is inconsequential compared to the value of not having to actually call free on some critical path, but it's a choice you should think carefully about and profile for.