I appreciate the work done here.
Been having this feeling that things have got worse recently but didn't think it could be model related.
The most frustrating aspect recently (I have learned and accepted that Claude produces bad code and probably always did, mea culpa) is the non-compliance. Claude is racing away doing its own thing, fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.
The stuff about token consumption is also interesting. Minimax/Composer have this habit of extensive thinking and it is said to be their strength but it seems like that comes at a price of huge output token consumption. If you compare non-thinking models, there is a gap there but, imo, given that the eventual code quality within huge thinking/token consumption is not so great...it doesn't feel a huge gap.
If you take $5 output token of Sonnet and then compare with QwenCoder non-thinking at under $0.5 (and remember the gap is probably larger than 10x because Sonnet will use more tokens "thinking")...is the gap in code quality that large? Imo, not really.
Have been a subscriber since December 2024 but looking elsewhere now. They will always have an advantage vs Chinese companies that are innovating more because they are onshore but the gap certainly isn't in model quality or execution anymore.
> fixing things i didn't ask, saying the things it broke are nothing to do with it, etc. Quite unpleasant to work with.
maybe they tried to give it the characteristics of motivated junior developers
classic :D i did think when i wrote that maybe AGI is already here, definitely worked with enough devs like that