These models are so powerful.
It's totally possible to build entire software products in the fraction of the time it took before.
But, reading the comments here, the behaviors from one version to another point version (not major version mind you) seem very divergent.
It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
But it also feels like you have to be diligent about adopting new models (even same family and just point version updates) because they operate totally differently regardless of your prompt and agent files.
Imagine managing a team of software developers where every month it was an entirely new team with radically different personalities, career experiences and guiding principles. It would be chaos.
I suspect that older models will be deprecated quickly and unexpectedly, or, worse yet, will be swapped out with subtle different behavioral characteristics without notice. It'll be quicksand.
I had an interesting experience recently where I ran Opus 4.6 against a problem that o4-mini had previously convinced me wasn't tractable... and Opus 4.6 found me a great solution. https://github.com/simonw/sqlite-chronicle/issues/20
This inspired me to point the latest models at a bunch of my older projects, resulting in a flurry of fixes and unblocks.
I have a codebase (personal project) and every time there is a new Claude Opus model I get it to do a full code review. Never had any breakages in last couple of model updates. Worried one day it just generates a binary and deletes all the code.
No version control?
I was being facetious, I mean one day models might skip the middle man of code and compilation and take your specs and produce an ultra efficent binary.
Musk was saying that recently but I don't see it being efficient or worthwhile to do this. I could be proven brutally wrong, but code is language; executables aren't. There's also no real reason to bother with this when we have quick-compiling languages.
More realistically, I could see particular languages and frameworks proving out to be more well-designed and apt for AI code creation; for instance, I was always too lazy to use a strongly-typed language, preferring Ruby for the joy of writing in it (obsessing about types is for a particular kind of nerd that I've never wanted to be). But now with AI, everything's better with strong types in the loop, since reasoning about everything is arguably easier and the compiler provides stronger guarantees about what's happening. Similarly, we could see other linguistic constructs come to the forefront because of what they allow when the cost of implementation drops to zero.
You can map tokens to CPU instructions and train a model on that, that's what they do for input images I think.
I think the main limitation on the current models is not that cpu instructions aren't cpu instructions (even though they can be with .asm), it's that they are causal, the cpu would need to generate a binary entirely from start to finish sequentially.
If we learned something over the last 50 years of programming is that that's hard and that's why we invented programming languages? Why would it be simpler to just generate the machine code, sure maybe an LLM to application can exist, but my money is in that there will be a whole toolchain in the middle, and it will probably be the same old toolchain that we are using currently, an OS, probably Linux.
Isn't it more common that stuff builds on the existing infra instead of a super duper revolution that doesn't use the previous tech stack? It's much easier to add onto rather than start from scratch.
Those CPU instructions still need to be making calls out to things, though. Hallucinated source code will reveal its flaws through linters, compiler errors, test suites. A hallucinated binary will not reveal its flaws until it segfaults.
Programs that pass linters, compile and test suites can still segfault. A good test harness that test the binary comprehensively can limit this. The model could be trained to have patterns of efficient assembly it uses rather than source code.
From the project description here for your sqlite-chronicle project:
> Use triggers to track when rows in a SQLite table were updated or deleted
Just a note in case its interesting to anyone, sqlite compatible Turso database has CDC, a changes table! https://turso.tech/blog/introducing-change-data-capture-in-t...
This may seem obvious, but many people overlook it. The effect is especially clear when using an AI music model. For example, in Suno AI you can remaster an older AI generated track with a newer model. I do this with all my songs whenever a new model is released. It makes it super easy to see the improvements that were made to the models over time.
I continue to get great value out of having claude and codex bound together in a loop: https://github.com/pjlsergeant/moarcode
They are one, the ring and the dark lord
I keep giving the top Anthropic, Google and OpenAI models problems.
They come up with passable solutions and are good for getting juices flowing and giving you a start on a codebase, but they are far from building "entire software products" unless you really don't care about quality and attention to detail.
That is my experience too. I don't know what others are building but the more novel the task is the worse these models perform.
Yeah I keep maintaining a specific app I built with gpt 5.1 codex max with that exact model because it continues to work for the requests I send it, and attempts with other models even 5.2 or 5.3 codex seemed to have odd results. If I were superstitious I would say it’s almost like the model that wrote the code likes to work on the code better. Perhaps there’s something about the structure it created though that it finds easier to understand…
> It feels like we are now able to manage incredibly smart engineers for a month at the price of a good sushi dinner.
In my experience it’s more like idiot savant engineers. Still remarkable.
Its like getting access to an amazing engineer, but you get a new individual engineer each prompt, not one consistent mind.
Sushy dinner? What are you building with AI, a calculator?
I have long suspected that a large part of people's distaste for given models comes from their comfort with their daily driver.
Which I guess feeds back to prompting still being critical for getting the most out of a model (outside of subjective stylistic traits the models have in their outputs).
"These models are so powerful."
Careful.
Gemini simply, as of 3.0, isn't in the same class for work.
We'll see in a week or two if it really is any good.
Bravo to those who are willing to give up their time to test for Google to see if the model is really there.
(history says it won't be. Ant and OAI really are the only two in this race ATM).