> I've seen claude "monkey patching" a system so that it returns true to the tests.
I’ve watched Github Copilot do the same thing. I’ve also seen it doubling down on ridiculous things and just spewing crash-laden messes. There seems to be a low upper ceiling on how “competent” it is, which makes sense.
In my own use of Copilot, I found Gemini gives me better results than ChatGPT and Claude. To the point where ChatGPT and Claude will flounder on a problem for hours of back and forth, where Gemini will one-shot the same thing.