Since DeepSeek R1 is open weight, wouldn't it be better to validate the napkin math to validate how many realistic LLM full inferences can be done on a single H100 in a time period, and calculate the token cost of that?
Without having in depth knowledge of the industry, the margin difference between input and output tokens is very odd to me between your napkin math and the R1 prices. That's very important as any reasoning model explodes reasoning tokens, which means you'll encounter a lot more output tokens for fewer input tokens, and that's going to heavily cut into the high margin ("essentially free") input token cost profit.
Unless I'm reading the article wrong.
I am so glad someone else called this out, I was reading the napkin math portions and struggling to see how the numbers really worked out and I think you hit the nail on the head. The author is assuming 'essentially free' input token cost and extrapolating in a business model that doesn't seem to connect directly to any claimed 'usefulness'. I think the bias on this is stated in the beginning of the article clearly as the author assumes 'given how useful the current models are...'. That is not a very scientific starting point and I think it leads to reasoning errors within the business model he posits here.
There were some oddities with the numbers themselves as well but I think it was all within rounding, though it would have been nice for the author to spell it out when he rounded some important numbers (~s don't tell me a whole lot).
TL;DR I totally agree, there are some napkin math issues going on here that make this pretty hard to see as a very useful stress test of cost.