Indeed running a vintage, closed-source, binary under an emulator it's hard to see what it is trying to do, short of decompiling it, and understanding it. Then I can use that knowledge to improve the emulation until it successfully runs.
I suggested in my initial comment I'd had essentially zero success in using LLMs for these kind of tasks, and your initial reply was "I've done it, just give all the information in the prompt", and I guess here we are! LLMs clearly work for some people, and some tasks, but for these kind of issues I'd say we're not ready and my attempts just waste my time, and give me a poor impression of the state of the art.
Even "Looking at this project which areas of the CP/M 2.2 BIOS or BDOS implementations look sucpicious?", "Identify bugs in the current codebase?", "Improve test-coverage to 99% of the BIOS functionality" - prompts like these feel like they should cut the job in half, because they don't relate to running specific binaries also do nothing useful. Asking for test-coverage is an exercise in hallucination, and asking for omissions against the well-known CP/M "spec" results in noise. It's all rather disheartening.
> Indeed running a vintage, closed-source, binary under an emulator it's hard to see what it is trying to do, short of decompiling it, and understanding it.
Break it down. Tell the LLM you're having trouble figuring out what the compiler running under the emulator is doing to trigger the issue, what you've done already, and ask for it's help using a debugger and other tools to inspect the system. When I did this o1 taught me some new LLDB tricks I'd never seen before. That helped me track down the cause of a particularly pernicious infinite recursion in the geometry processing code of a CAD kernel.
> Even "Looking at this project which areas of the CP/M 2.2 BIOS or BDOS implementations look sucpicious?", "Identify bugs in the current codebase?", "Improve test-coverage to 99% of the BIOS functionality" - prompts like these feel like they should cut the job in half, because they don't relate to running specific binaries also do nothing useful.
These prompts seem very vague. I always include a full copy of the codebase I'm working on in the prompt, along with a full copy of whatever references are needed, and rarely ask it questions as general as "find all the bugs". That is quite open ended and provides little context for it to work with. Asking it to "find all the buffer overflows" will yield better results. As it would with a human. The more specific you can get the better your results will be. It's also a good idea to ask the LLM to help you make better prompts for the LLM.
> Asking for test-coverage is an exercise in hallucination, and asking for omissions against the well-known CP/M "spec" results in noise.
In my experience hallucinations are a symptom of not including the necessary relevant information in the prompt. LLM memories, like human memories, are lossy and if you force it to recall something from memory you are much more likely to get a hallucination as a result. I have never experienced a hallucination from a reasoning model when prompted with a full codebase and all relevant references. It just reads the references and uses them.
It seems like you've chosen a particularly extreme example - a vintage, closed-source, binary under an emulator - didn't immediately succeed, and have written off the whole thing as a result.
A friend of mine only had an ancient compiled java app as a reference, he uploaded the binary right in the prompt, and the LLM one-shotted a rewrite in javascript that worked first time. Sometimes it just takes a little creativity and willingness to experiment.