Anybody with deep knowledge of current RISC-V opensource implementations here?
Do harts have store queue and load queue optimizations? Namely some kind of memory request fusion?
I asked this question because since I am writing rv64 assembly, and since rv64 is a load/store architecture, I tend to pack as much as I can memory ordered loads and stores.
I suppose everything that isn't a toy implementation has a store queue.
Even the U54 Core Complex (later U54-MC) manual from August 2018 states in Section 3.4 "Stores are pipelined and commit on cycles where the data memory system is otherwise idle. Loads to addresses currently in the store pipeline result in a five-cycle penalty."
It probably inherited this from Rocket.
huh, a load which happens to hit the store queue should be faster that usual since it does not even need to reach the cache fabric, shouldn't it?
Nope. Very common. Making a FIFO also randomly content-addressable adds a lot to the complexity, and only code too unoptimised to care about loads a value within half a dozen instructions of storing it -- just use it directly from the register you stored it from.
I'm pretty sure XiangShan has a store queue. I expect the other chips mentioned do too - as I understand it it's a standard optimisation.