Interesting prediction. It sort of makes sense. I have noticed that LLMs are very good at solving problems whose solutions are easy to check[0]. It ends up being quite an advantage to be able to work on such problems because rarely does an LLM truly one-shot a solution through token generation. Usually the multi-shot is 'hidden' in the reasoning tokens, or for my use-cases it's usually solved via the verification machine.

A formally verified system is easier for the model to check and consequently easier for it to program to. I suppose the question is whether or not formal methods are sufficiently tractable that they actually do help the LLM be able to finish the job before it runs out of its context.

Regardless, I often use coding assistants in that manner:

1. First, I use the assistant to come up with the success condition program

2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

3. Then I check the solution myself

It's not rocket science, and is just the same approach we've always taken to problem-solving, but it is nice that modern tools can also work in this way. With this, I can usually use Opus or GPT-5.2 in unattended mode.

0: https://wiki.roshangeorge.dev/w/Blog/2025-12-11/LLMs_Excel_A...

The issue is that many problems aren't easy to verify, and LLMs also excel at producing garbage output that appears correct on the surface. There are fields of science where verification is a long and arduous process, even for content produced by humans. Throwing LLMs at these problems can only produce more work for a human to waste time verifying.

Yes, that is true. And for those problems, those who use LLMs will not get very far.

As for those who use LLMs to impersonate humans, which is the kind of verification (verify that this solution that is purported to be built by a human actually works), I have no doubt we will rapidly evolve norms that make us more resistant to them. The cost of fraud and anti-fraud is not zero, but I suspect it will be much less than we fear.

> 1. First, I use the assistant to come up with the success condition program

> 2. Then I use the assistant to solve the original problem by asking it to check with the success condition program

This sounds a lot like Test-Driven Development. :)

It is the exact same flow. I think a lot of things in programming follow that pattern. The other one I can think of is identifying the commit that introduces a regression: write the test program, git-bisect.