Right now they are fancy autocompletes. That is enormously useful for a language where 90% of the typing is boilerplate in desperate need of autocompletion.
Most of the “interesting” logic I write is nowhere close to autocompleted successfully and most of it needs to be thrown out. If you’re spending most of your days writing glue that translates one set of JSON documents or HTTP requests into another I’m sure they’re wildly useful.
I don't know which models you are using, but in my experience they have been way more than fancy autocomplete today. I have had thousand line programs written and refined with just a few prompts. On the analysis and code review side, they have been even more impressive, finding issues and potential impacts of changes and describing the intent behind the code. I implore you to revisit good models like Gemini 2.5 Pro. To wit, there was an actual Linux kernel vulnerability in SMB protocol stack discovered with LLM a few days ago.
Even if we take the narrow use case of boilerplate glue code that transforms data from one place to another, that encompasses almost all programs people write, statistically. There was a running joke at Google "we are just moving protobufs." I would not call this "fancy autocomplete."
It comes back to the nature of the work; I've got a hobby project which is basically an emulator of CP/M, a system from the 70s, and there is a bug in it.
My emulator runs BBC Basic, Zork, Turbo Pascal, etc, etc, but when it is used to run a vintage C compiler from the 80s it gives the wrong results.
Can an LLM help me identify the source of this bug? No. Can I say "fix it"? No. In the past I said "Write a test-case for this CP/M BDOS function, in the same style as the existing tests" and it said "Nope" and hallucinated functions in my codebase which it tried to call.
Basically if I use an LLM as an auto-completer it works slightly better than my Emacs setup already did, but anything more than that, for me, fails and worse still fails in a way that eats my time.
> Can an LLM help me identify the source of this bug? No. Can I say "fix it"? No. In the past I said "Write a test-case for this CP/M BDOS function, in the same style as the existing tests"
These are all things I've done successfully with ChatGPT o1 and o3 in a 7.5kloc Rust codebase.
I find the key is to include all information which may be necessary to solve the problem in the prompt. That simple.
I wrote a summary of my issue on a github comment, and I guess I will try again
https://github.com/skx/cpmulator/issues/234#issuecomment-291...
But I'm not optimistic; all previous attempts at "identify the bug", "fix the bug", "highlight the area where the bug occurs" just turn into timesinks and failures.
It seems like your problem may be related to asking it to analyze the whole emulator _and_ compiler to find the bug. I'd recommend working first to pare the bug down to a minimal test case which triggers the issue - the LLM can help with this task - and then feed the LLM the minimal test case along with the emulator source and a description of the bug and any state you can exfiltrate from the system as it experiences the issue.
Indeed running a vintage, closed-source, binary under an emulator it's hard to see what it is trying to do, short of decompiling it, and understanding it. Then I can use that knowledge to improve the emulation until it successfully runs.
I suggested in my initial comment I'd had essentially zero success in using LLMs for these kind of tasks, and your initial reply was "I've done it, just give all the information in the prompt", and I guess here we are! LLMs clearly work for some people, and some tasks, but for these kind of issues I'd say we're not ready and my attempts just waste my time, and give me a poor impression of the state of the art.
Even "Looking at this project which areas of the CP/M 2.2 BIOS or BDOS implementations look sucpicious?", "Identify bugs in the current codebase?", "Improve test-coverage to 99% of the BIOS functionality" - prompts like these feel like they should cut the job in half, because they don't relate to running specific binaries also do nothing useful. Asking for test-coverage is an exercise in hallucination, and asking for omissions against the well-known CP/M "spec" results in noise. It's all rather disheartening.
> Indeed running a vintage, closed-source, binary under an emulator it's hard to see what it is trying to do, short of decompiling it, and understanding it.
Break it down. Tell the LLM you're having trouble figuring out what the compiler running under the emulator is doing to trigger the issue, what you've done already, and ask for it's help using a debugger and other tools to inspect the system. When I did this o1 taught me some new LLDB tricks I'd never seen before. That helped me track down the cause of a particularly pernicious infinite recursion in the geometry processing code of a CAD kernel.
> Even "Looking at this project which areas of the CP/M 2.2 BIOS or BDOS implementations look sucpicious?", "Identify bugs in the current codebase?", "Improve test-coverage to 99% of the BIOS functionality" - prompts like these feel like they should cut the job in half, because they don't relate to running specific binaries also do nothing useful.
These prompts seem very vague. I always include a full copy of the codebase I'm working on in the prompt, along with a full copy of whatever references are needed, and rarely ask it questions as general as "find all the bugs". That is quite open ended and provides little context for it to work with. Asking it to "find all the buffer overflows" will yield better results. As it would with a human. The more specific you can get the better your results will be. It's also a good idea to ask the LLM to help you make better prompts for the LLM.
> Asking for test-coverage is an exercise in hallucination, and asking for omissions against the well-known CP/M "spec" results in noise.
In my experience hallucinations are a symptom of not including the necessary relevant information in the prompt. LLM memories, like human memories, are lossy and if you force it to recall something from memory you are much more likely to get a hallucination as a result. I have never experienced a hallucination from a reasoning model when prompted with a full codebase and all relevant references. It just reads the references and uses them.
It seems like you've chosen a particularly extreme example - a vintage, closed-source, binary under an emulator - didn't immediately succeed, and have written off the whole thing as a result.
A friend of mine only had an ancient compiled java app as a reference, he uploaded the binary right in the prompt, and the LLM one-shotted a rewrite in javascript that worked first time. Sometimes it just takes a little creativity and willingness to experiment.
7.5 kloc is pretty tiny, sounds like you may be able to get the entire thing into the context.
Lots of Rust libraries are relatively small since Cargo makes using many libraries in a single project relatively easy. I think that works in favor of both humans and LLMs. Treating the context window as an indication that splitting code up into smaller chunks might be a good idea is an interesting practice.
I generally have to maintain the code I write, often by myself; thousands of lines of uninspired slop code is the last thing I need in my life.
Friction is the birth place of evolution.
Some people go to camping now and then to hunt their own food and feel connected to nature and feel that friction. They just won't want it every day. Just like they don't tend to generate the underlying uninspired assembly themselves. FWIW if your premise is the code they generate is necessarily unmaintainable compared to an average CS college graduate human baseline, I'd argue against that premise.