This makes sense to me. Where I work our ai team set up a couple h100 cards and are hosting a newer model that uses up around 80GB vram. You can see the gpu utilization on graphana go to like 80% for seconds as it processes a single request. That was very surprising to me. This is $30k worth of hardware that can support only a couple users and maybe only 1 if you have an agent going. Now, maybe we're doing something wrong, but it's hard to imagine anyone is going to make money on hosting billions of dollars of these cards when you're making $20 a month per card. I guess it depends on how active your users are. Hard to imagine anthropic is right side up here.
But was that with batching? It makes a big difference. You can run many requests in parallel on the same card if you're doing LLM inferencing.