so much to unpack here and almost poetic that you say this
first is that the model will write out that it “thought” and “double checked” it’s output
Second, this was in a fresh context window of the latest model (that isn’t fable b/c we can’t use for reasons beyond this thread), and it was on it’s second highest thinking mode. I shouldn’t have to double check something that it claimed to have burned more tokens on to double check
Outside of it costing me more money to fix what it claims to do, the main point of this article is that models are implementing things nearly end to end, and if we scale it up, it will only continue to do that. I Intentionally chose the example of something that is < 70 lines to implement in TS (btw, the language with the second most amount of data available to scrape and train on) I would assume a machine that can almost implement things end to end should be able to implement something of 70 lines of code and has been documented for nearly 50 years.
My point is that time and time again on the most trivial examples, under the best of conditions, and with unlimited amounts of money, they can’t do what it claims
Outside of that, this follow up comment(s) that say, “oh you need to ask it to check its own work and be so involved in the process of it writing the code that you need to spot check it” goes against everything the article states
The best analogy I have for this is New speak in 1984, it’s just vibes dictating vibes and trying to make people claim that the vibes are right. and if you try to validate the vibes, your vibes are just wrong because you don’t get the vibes. The claims that it made have no data backing it. And if there is data, it’s cherry picked. Please use your brain and stop outsourcing your ability to think to a machine that is incorrectly thinking on your behalf
Edit: Typos
I guess the code you write always works correctly whem you submit the PR, and code reviews don't find real problems in it. it makes sense then to have the exact same bar when using LLMs