Yes. In order to serve 1k/s, they must be fitting the entire model on SRAM and not reaching out to off chip RAM. This means they’re likely chaining multiple wafer chips together to serve this model or they shrunk the model to fit one wafer chip. It’s uneconomical for many use cases but for highly valuable tasks, it could be worth it.

This is one area Nvidia chips have not been able to do, ultra fast, ultra high value tasks. Hence, the Grog acquisition.