Hacker News

Consider some of the scaling properties of frontier cloud LLMs:

1) routing: traffic can be routed to smaller, specialized, or quantized models

2) GPU throughput vs latency: both parameters can be tuned and adjusted based on demand. What seems like lots of deep "thinking" might just be trickling the inference over less GPU resources for longer.

3) caching