Counter-counter-argument: for LLMs, tokens are units of thinking. And token use is, on the margin, directly proportional to costs of inference. So while the details of the harness, and how you prompt the model, and nature of the code and docs you put in context, etc. all matter to the quality of output you get from LLM coding tool, ultimately, there's always a ceiling to how much you're willing to spend on solving a problem - say, no more than 30 minutes, or $10, on refactoring a target module or implementing a small feature - and that puts a limit on how much thinking the model can put into it.
Thing is, writing secure and efficient and readable and simple code is in many cases fundamentally over that limit. It's possible, but you can't afford (or rationally just don't want) to spend as much on it as it's required for superhuman quality on all these aspects. Also most of the time, you don't want to operate at a limit - you probably expected that feature to take 30 seconds and less than $1 to implement. So you choose, both what the model optimizes for, and how much.
Because of that, no matter how good the model and the harness and the prompting are, $10 spent on coding is still bound to leave behind some security vulnerabilities that subsequent $10 spent on security review will find (especially with a model post-trained for that, at expense of general performance).