Aren't these increases offset by the quality of the responses and reducing the iterations needed to fine-tune the responses?

Only for the range of tasks where 4.7 performs well but 4.6 performed suboptimally. If both models can one-shot the task without retries, then the number of iterations is already at the lower bound.

This also applies at the sub-task level. If both models need to read three files to figure out which one implements the function they need to modify, then the token tax is paid for all three files even though "not the right file" is presumably an easy conclusion to draw.

This is also related to the challenge of optimizing subagents. Presumably the outer, higher-capacity model can perform better with everything in its context (up to limits), but dispatching a less-capable subagent for a problem might be cheaper overall. Anthropic has a 5:1 cost on input tokens between Opus and Haiku, but Google has 8:1 (Gemini Pro : Flash Lite) and OpenAI has 12:1 (GPT 4.2 : 4.2 nano).

Likely only some of the time