Are you talking about context switching every handful of cycles? This is going to be extremely inefficient even with store forwarding.
Are you talking about context switching every handful of cycles? This is going to be extremely inefficient even with store forwarding.
Sure, and so is calling a function every handful of cycles. That's a big part of why compilers inline.
Either you're context switching often enough that store forwarding helps, or you're not spending a lot of time context switching. Either way, I would expect that you aren't waiting on L1: you put the write into a queue and move on.