While I agree that shared nothing wipes the pants performance-wise of shared state, surely the penalty you've outlined is only for super short lived connections?

For longer lived connections the cache is going to thrash on an inevitable context switch anyway (either do to needing to wait for more I/O or normal preemption). As long as processing of I/O is handled on a given core, I don't know if there is actually such a huge benefit. A single pinned thread for the entire lifecycle has the problem that you get latency bottlenecks under load where two CPU-heavy requests end up contending for the same core vs work stealing making use of available compute.

The ultimate benefit would be if you could arrange each core to be given a dedicated NIC. Then the interrupts for the NIC are arriving on the core that's processing each packet. But otherwise you're already going to have to wake up the NIC on a random core to do a cross-core delivery of the I/O data.

TLDR: It's super complex to get a truly shared nothing approach unless you have a single application and you correctly allocate the work. It's really hard to solve generically optimally for all possible combinations of request and processing patterns.