The problem is that the differences between flagship and local models are compounding heavily. An 4% different could be massive when you keep iterating on the same code base.

> The problem is that the differences between flagship and local models are compounding heavily

This depends a lot on how you work, and how much of the architectural thinking you do yourself.

People seem to lose sight of the fact that a flash model today is as powerful as a frontier model from a year ago. If you were happy with GPT 4.x, you should be ecstatic that equivalent power is now basically free...

I am one of those ecstatic folk :)