> My stance has been pretty rigid for some time: LLMs hallucinate, so they aren’t reliable building blocks. If you can’t rely on the translation step, you can’t treat it as a serious abstraction layer because it provides no stable guarantees about the underlying system.

This is technically true. But unimportant. When I write code in a higher level language and it gets compiled to machine code, ultimately I am testing statically generated code for correctness. I don’t care what type of weird tricks the compiler did for optimizations.

How is that any different than when someone is testing LLM generated C code? I’m still testing C code that isn’t going to magically be changed by the LLM without my intervention anymore than my C code is going to be changed without my recompiling it.

On this latest project I was on, the Python generated code by Codex was “correct” with the happy path. But there were subtle bugs in the distributed locking mechanics and some other concurrency controls I specified. Ironically, those were both caught by throwing the code in ChatGPT in thinking mode.

No one is using an LLM to compute is a number even or odd at runtime.

Because for all high level languages, errors happen at the same level of the language. You do not write programs in Go and then verify it in opcodes with a dissasembler. Incorrect syntax and runtime reference the Go files and symbols, not CPU registers.

The same thing happens in JavaScript. I debug it using a Javascript debugger, not with gdb. Even when using bash script, you don’t debug it by going into the programs source code, you just consult the man pages.

When using LLM, I would expect not to go and verify the code to see if it actually correct semantically.

If it works with all of your human or even generated test cases, why do I care if it decided to use a while loop or a for loop?

Like I said above, I do know to watch out for implementations that “Work on my Machine” but don’t work at scale or involve concurrency. But I have had to check for the same issues when I delegate work to more junior developers.

This is not meant to be an insult toward you. But my not doing front end development for well over a decade, a front end developer might as well be a “human LLM” to me. I’m going to give you the business requirements and constraints and you are going to come back with a website. I am just going to check it meets the business requirements and not tell you the how. I’m definitely not going to look at the code.

I just had a web project I had to modify for a new project, I used Codex and didn’t look at a line of code. Yeah I know JavaScript. But I have no idea whether the initial developer who worked on on another project I led or whether the Codex changes were idiomatic. I know the developer and Codex met my functional requirements.

> I don’t care what type of weird tricks the compiler did for optimizations.

you might not, but plenty of others do. on the jvm for example, anyone building a performance sensitive application has to care about what the compiler emits + how the jit behaves. simple things like accidental boxing, megamorphic call preventing inlining, etc. have massive effects.

i've spent many hours benchmarking, inspecting in jitwatch, etc.

And 95%+ developers aren’t writing performance sensitive code. In my career, most bottlenecks I’ve seen are because of bad database design, network latency, or other infrastructure related issuesor in the cloud days startup latency for anything serviceless.

Yes I know every millisecond a company like Google can shave off, is multiplied by billions of transactions a day and can save real money on infrastructure. But even at a second tier company like Salesforce, it probably doesn’t matter

it all matters. if more people took pride in their craft and understood the behavior of their tools, modern software wouldn’t be so horrid

To a first approximation, no one gets paid to write bespoke hand crafted software. We get paid to make the company more money or save the company more money than the fully allocated cost to employ us to make computers do things. I take “pride” in the fact that software and implementations I designed meets the requirements of the people that paid me to write it - whether that be by a combination of my work and my delegated work to humans or LLMs

Over the past decade, part of my job has been to design systems, talk to “stakeholders” and delegate some work and do some myself. I’m neither a web developer nor a mobile developer.

I don’t look at a line of code for those types of implementations. I do make sure they work. From my perspective, those that I delegated to might as well be “human LLMs”.

Which is a good example on how managed runtimes are already not deterministic and how hard it is to reproduce scenarios.

I agree, in my original comment, I went out of the way to say “C” in my hypothetical argument.

But even with C, it’s still not completely deterministic with out of order and predictive branching, cache hits vs misses etc. Didn’t exactly this cause some of the worse processor level security issues we had seen in years?