> Can an LLM help me identify the source of this bug? No. Can I say "fix it"? No. In the past I said "Write a test-case for this CP/M BDOS function, in the same style as the existing tests"

These are all things I've done successfully with ChatGPT o1 and o3 in a 7.5kloc Rust codebase.

I find the key is to include all information which may be necessary to solve the problem in the prompt. That simple.

I wrote a summary of my issue on a github comment, and I guess I will try again

https://github.com/skx/cpmulator/issues/234#issuecomment-291...

But I'm not optimistic; all previous attempts at "identify the bug", "fix the bug", "highlight the area where the bug occurs" just turn into timesinks and failures.

It seems like your problem may be related to asking it to analyze the whole emulator _and_ compiler to find the bug. I'd recommend working first to pare the bug down to a minimal test case which triggers the issue - the LLM can help with this task - and then feed the LLM the minimal test case along with the emulator source and a description of the bug and any state you can exfiltrate from the system as it experiences the issue.

Indeed running a vintage, closed-source, binary under an emulator it's hard to see what it is trying to do, short of decompiling it, and understanding it. Then I can use that knowledge to improve the emulation until it successfully runs.

I suggested in my initial comment I'd had essentially zero success in using LLMs for these kind of tasks, and your initial reply was "I've done it, just give all the information in the prompt", and I guess here we are! LLMs clearly work for some people, and some tasks, but for these kind of issues I'd say we're not ready and my attempts just waste my time, and give me a poor impression of the state of the art.

Even "Looking at this project which areas of the CP/M 2.2 BIOS or BDOS implementations look sucpicious?", "Identify bugs in the current codebase?", "Improve test-coverage to 99% of the BIOS functionality" - prompts like these feel like they should cut the job in half, because they don't relate to running specific binaries also do nothing useful. Asking for test-coverage is an exercise in hallucination, and asking for omissions against the well-known CP/M "spec" results in noise. It's all rather disheartening.

> Indeed running a vintage, closed-source, binary under an emulator it's hard to see what it is trying to do, short of decompiling it, and understanding it.

Break it down. Tell the LLM you're having trouble figuring out what the compiler running under the emulator is doing to trigger the issue, what you've done already, and ask for it's help using a debugger and other tools to inspect the system. When I did this o1 taught me some new LLDB tricks I'd never seen before. That helped me track down the cause of a particularly pernicious infinite recursion in the geometry processing code of a CAD kernel.

> Even "Looking at this project which areas of the CP/M 2.2 BIOS or BDOS implementations look sucpicious?", "Identify bugs in the current codebase?", "Improve test-coverage to 99% of the BIOS functionality" - prompts like these feel like they should cut the job in half, because they don't relate to running specific binaries also do nothing useful.

These prompts seem very vague. I always include a full copy of the codebase I'm working on in the prompt, along with a full copy of whatever references are needed, and rarely ask it questions as general as "find all the bugs". That is quite open ended and provides little context for it to work with. Asking it to "find all the buffer overflows" will yield better results. As it would with a human. The more specific you can get the better your results will be. It's also a good idea to ask the LLM to help you make better prompts for the LLM.

> Asking for test-coverage is an exercise in hallucination, and asking for omissions against the well-known CP/M "spec" results in noise.

In my experience hallucinations are a symptom of not including the necessary relevant information in the prompt. LLM memories, like human memories, are lossy and if you force it to recall something from memory you are much more likely to get a hallucination as a result. I have never experienced a hallucination from a reasoning model when prompted with a full codebase and all relevant references. It just reads the references and uses them.

It seems like you've chosen a particularly extreme example - a vintage, closed-source, binary under an emulator - didn't immediately succeed, and have written off the whole thing as a result.

A friend of mine only had an ancient compiled java app as a reference, he uploaded the binary right in the prompt, and the LLM one-shotted a rewrite in javascript that worked first time. Sometimes it just takes a little creativity and willingness to experiment.

7.5 kloc is pretty tiny, sounds like you may be able to get the entire thing into the context.

Lots of Rust libraries are relatively small since Cargo makes using many libraries in a single project relatively easy. I think that works in favor of both humans and LLMs. Treating the context window as an indication that splitting code up into smaller chunks might be a good idea is an interesting practice.