It's quite plausible to me that the difference is inference configuration. This could be done through configurable depth, Moe experts, layers etc. Even beam decoding changes can make substantial performance changes.
Train one large model, then down configure it for different pricing tiers.
I dont think thats plausible because they also just launched a high-speed variant which presumably has the inference optimization and smaller batching and costs about 10x
also, if you have inference optimizations why not apply them to all models?