No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.
No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.
Thanks for the correction (author here). I'll update the article - very fair point on compute on input tokens which I messed up. Tbh I'm pleased my napkin math was only 7x off the laws of physics :).
Even rerunning the math on my use cases with way higher input token cost doesn't change much though.
The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.
The component about requiring long context lengths to be compute-bound for attention is also quite misleading.
Anyone up to publishing their own guess range?
I’m pretty sure input tokens are cheap because they want to ingest the data for training later no? They want huge contexts to slice up.
Afaik all the large providers flipped the default to contractually NOT train on your data. So no, training data context size is not a factor.