Very cool! I would look at the tokens returned by each of the calls. You can map those to API costs per input/output tokens. Forge should be capturing those (or can, as passthrough from llama.cpp).
At least, if I understand your economic benefit angle correctly.
For scenarios to get inspired by I'd look at those tagged "model_quality" or "advanced_reasoning".