It feels like Fable is slightly smarter but overall worse tool exactly due to this.
It's constantly turning what should be 50 LOC patch of a single prompt into 30 minute exploration that is totally not worth it. Often wrong even.
I trialed it on some rather simple stuff - backfill redis dedupe cache when the hash function changed: instead of running new hash func on every db value to expand the cache it implemented some overly-complex cache update that tried to guess hashing func version of each cached value and recalculate only the old hashes. I can imagine in some context this would make sense maybe? but not 30 minutes of token burn that got replaced by 10 lines for loop by me.
I fear that this is generally bad news for programming. LLM tech is clearly running into a diminishing returns wall on intelligence but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved, except I guess people who sell the tokens and people who can afford these tokens to scan for 0-days.
> but a response to that is to just make them more relentless which is a pretty poor solution for everyone involved
I see two problems with LLMs & agents which wont be fixed possibly forever.
1) They dont have causal models. What they can do only is trial-and-error exploration which works quite well for many problems. But many other problems require a causal model.
2) Prompts lack precision, and programming languages and machine models were invented to solve this problem. English is great, but it is not a programming language.
I actually think internally they knew they hit diminishing returns awhile ago.
They’ve been doing a lot of strategic introduction and manipulation in the run up to the IPO, and it’s worked in that regard.
The other day I was doing something that required CC to update like 15-20 files in exactly the same way (hoist a specific function out of the component body) and instead of just updating the files, it spun up multiple agents, one of which wrote a perl script to hunt down all the files, do some regex, and replace all occurrences. And then instead of just running tsc to check for errors, it wrote a script to run tsc in each of the subagents and combine the results.
It was actually pretty maddening as what should have taken a minute or two tops took like 10 because it went down this route.
I'm gonna try something much more complex later, but for simple things, it felt like driving a corvette to the mailbox.