I use LLMs (like claude-code and codex-cli) the same way accountants use calculators. Without one, you waste all your focus on adding numbers; with one, you just enter values and check if the result makes sense. Programming feels the same—without LLMs, I’m stuck on both big problems (architecture, performance) and small ones (variable names). With LLMs, I type what I want and get code back. I still think about whether it works long-term, but I don’t need to handle every little algorithm detail myself.

Of course there are going to be discussions what is real programming (like I'm sure there were discussions what is "real" accounting with the onset of a calculator)

The moment we stop treating LLMs like people and see them as big calculators, it all clicks.

The issue with your analogy is that calculators do not hallucinate. They do not make mistakes. An accountant is able to fully offload the mental overhead of arithmetic because the calculator is reliable.

> The issue with your analogy is that calculators do not hallucinate. They do not make mistakes. An accountant is able to fully offload the mental overhead of arithmetic because the calculator is reliable.

If you've ever done any modeling/serious accounting, you'll find that you feel more like a DBA than a "person punching on a calculator". You ask questions and then you figure out how to get the answers you want by "querying" excel cells. Many times querying isn't in quotes.

To me, the analogy of the parent is quite apt.

But the database doesn't hallucinate data, if always does exactly what you ask it to do and gives you reliable numbers unless you ask it to do a random operation.

I agree databases don't hallucinate but somehow most databases still end up full of garbage.

Whenever people are doing the data entry you shouldn't trust your data. It's not the same as LLM hallucinations but it's not entirely different either.

I really don't understand the hallucination problem now in 2025. If you know what you're doing and you know what you need to get from the LLM and you can describe it well enough that it would be hard to screw up, LLMs are incredibly useful. They can nearly one shot an entire (edited here) skeleton architecture that I only need to nudge into the right place before adding what I want on top of it. Yes, i run into code from LLMs that i have to tweak, but it has been incredibly helpful for me. I haven't had hallucination problems in a couple of years now...

> I really don't understand the hallucination problem now in 2025

Perhaps this OpenAI paper would be interesting then (published September 4th):

https://arxiv.org/pdf/2509.04664

Hallucination is still absolutely an issue, and it doesn’t go away by reframing it as user error, saying the user didn’t know what they were doing, didn’t know what they needed to get from the LLM, or couldn’t describe it well enough.

That is why you check your results. If you know what the end outcome should be. Doesn’t matter if it hallucinates. If it does, it probably already got you 90% of the work done which is less work that you have to do now to finish it.

This only works for classes of problems where checking the answer is easier than doing the calculation. Things like making a visualization, writing simple functions, etc. For those, it’s definitely easier to use an LLM.

But a lot of software isn’t like that. You can introduce subtle bugs along the way, so verifying is at least as hard as writing it in the first place. Likely harder, since writing code is easier than reading for most people.

Exactly, thank you.

I recognize that an accountant’s job is more than just running a bunch of calculations and getting a result. But part of the job is doing that, and it would be a real PITA if their calculator was stochastic. I would definitely not call it a productivity enhancer.

If my calculator sometimes returned incorrect results I would throw it out. And I say this as an MLE who builds neural nets.

You still make mistakes. Just because you did it yourself doesn't mean it's error free. The more complex the question the more error prone.

Thankfully the more complex the question almost always there's more than one way to derive the answer and you use that to check.

Replace calculator with the modern equivalent: Excel.

It does make mistakes and is not reliable[0]. The user still needs to have a "feel" for the data.

(to be pedantic "Excel" doesn't make mistakes, people trusting its defaults do)

[0] https://timharford.com/2021/05/cautionary-tales-wrong-tools-...

With an LLM there is no learning curve though (or a minimal one at best). No expert can prevent an LLM from hallucinating, even (and especially) the people building them.

> (to be pedantic "Excel" doesn't make mistakes, people trusting its defaults do)

So what is your point? An expert that mastered excel don't have to check that excel calculated things correctly, he just need to check that he gave excel the right inputs and formulas. That is not true for LLM, you do have to check that it actually did what you asked regardless how good you are at prompting.

The only thing I trust an LLM to do correctly are translations, they are very reliable at that, other than that I always verify.

"Just" check that every cell in the million row xlxs file is correct.

See the issue here?

Excel has no proper built-in validation or test suite, not sure about 3rd party ones. The last time I checked some years back there was like one that didn't do much.

All it takes is one person accidentally or unknowingly entering static data on top of a few formulas in the middle and nobody will catch it. Or Excel "helps" by changing the SEPT1 gene to "September 1. 2025"[0] - this case got so bad they had to RENAME the gene to make Excel behave. "Just" doing it properly didn't work at scale.

The point I'm trying to get at here that neither tool is perfect and requires validation afterwards. With agentic coding we can verify the results, we have the tools for it - and the agent can run them automatically.

In this case Excel is even worse because one human error can escalate massively as there is no simple way to verify the output, Excel has no unit test equivalents or validators.

[0] https://www.progress.org.uk/human-genes-renamed-as-microsoft...

You are describing a garbage in, garbage out problem. However, LLMs introduce a new type of issue, the “valid data in, garbage out” problem. The existence of the former doesn’t make the latter less of an issue.

“Just” checking a million rows is trivial depending on the types of checks you’re running. In any case, you would never want a check which yields false positives and false negatives, since that defeats the entire purpose of the check.

[deleted]

It depends on how much you want the LLM to do. I personally work on function level and can easily verify if it works with a look and few tests.

That's why you tell claude code to write tests, and use them, use linting tools, etc. And then you test the code yourself. If you're still concerned, /clear then tell claude code that some other idiot wrote the code and it needs to tear it apart and critique it.

Hallucination is not an intractable problem, the stochastic nature of hallucinations makes it easy to use the same tools to catch them. I feel like hallucinations have become a cop out, an excuse, for people who don't want to learn how to use these new tools anyway.

> you now have to not only review and double-check shitty AI code, but also hallucinated AI tests too

Gee thanks for all that extra productivity, AI overlords.

Maybe they should replace AI programmers with AI instead?

I said to make the chatbot do it, not to do all the reviewing yourself. You can do manual reviews once it makes something that works. In the meantime, you can be working on something else entirely.

> In the meantime, you can be working on something else entirely.

Like fixing useless and/or broken tests written by an LLM?

(Thank you, AI overlords, for freeing me from the pesky algorithmic and coding tedia so I can instead focus on fixing the mountains of technical debt you added!)

I'm assuming based on the granularity you're referring to autocomplete, and surely that already doesn't feel like dialup.