Even though tokens are getting cheaper, I think the real killer of "unlimited" LLM plans isn't token costs themselves, it's the shape of the usage curve that's unsustainable. These products see a Zipf-like distribution: thousands of casual users nibble a few-hundred tokens a day while a tiny group of power automations devour tens of millions. Flat pricing works fine until one of those whales drops a repo-wide refactor or a 100 MB PDF into chat and instantly torpedoes the margin. Unless vendors turn those extreme loops into cheaper, purpose-built primitives (search, static analyzers, local quantized models, etc.), every "all-you-can-eat" AI subscription is just a slow-motion implosion waiting for its next whale.

I actually think Anthropic's plans with capped usage per 5-hour period and per week is good, for exactly this problem.

I'd prefer it just specify a number of tokens rather than be variable on demand - I see that lets them be more generous during low periods but the opacity of it all sucks. I have 5-minute time-of-use pricing on my electricty and can look up the current rate on my phone in an insant - why not simply provide an API to look up the current "demand factor" for Claude (along with the rules for how the demand factor can change - min and max values for example) and let it be fully transparent?

Anthropic still relies on quotas (5-hour rolling + weekly coming at the end of the month) so the next step is dynamic per-token pricing. But, even with transparent off-peak rates only batch jobs will shift and history suggests variable pricing usually smooths rather than sharpens peaks. The long-term fix stays the same: route 100 MB PDFs, repo refactors and other whale jobs through retrieval or analysis pipelines and keep the flagship chat model for real-time conversation.