the details are key here. there is plenty of automatable financial work, sure, but also when it comes to reporting finances/costs (formally or informally) and having a real human being be accountable for them, you REALLY need to trust that nothing is hallucinated.

Any idea how they ensure this doesnt happen? As in, how can a user verify that the model did not touch any of the numbers and that it only built pipelines for them.

what I've been telling my CFO who wants to get AI involved in things is that for a lot of accounting and finance work "Trust but verify" doesnt work because verify is often the same process as doing the work.

> Any idea how they ensure this doesnt happen?

Build a deterministic query set and automate it for monthly or daily reporting reconcilliation.

Leave AI out of it.

The "real humans" doing the tasks being replaced are overworked kids less than 2yrs out of college on an average of 4hrs of sleep at working at 3am. If the AI makes their jobs take half as much time I bet they're a lot more likely to catch errors (and live longer).

at risk of sounding facetious, how exactly do you catch an error in a sum without performing the sum yourself?

How do you verify that all the tariffs are properly allocated to the correct GL code without going through the invoices and checking for each tariff on the list? How do you make sure none were accidentally assigned to other GL codes? All you have is pdfs, you dont know what the AI did or didnt do with the info on the pdf, there are not many use-cases to catch its errors without doing the work yourself.

If anything, it's going to add a step to these "kids" work where they have to use the AI to do the work and then redo 90% of the work anyway just to verify the output and then AI is going to get the credit anyway.

Or the overworked people are going to use AI and not verify it, which means not catching any errors or hallucinations, which apparently is fine because someone claims it's a solved problem for the black box of infinite possibility and inconsistent output.

It's like self-driving cars. You might want to accept human fault error rates until we prove overwhelmingly that the software is near-perfect, but others might want to switch to a system once it proves that it reliably beats most humans by a large factor, then work to mitigate the common errors it does have and improve.

When management signs off on work (SOX requires CEOs and CFOs to personally certify the accuracy of financial reports), they do not personally 'verify that all the tariffs are properly allocated to the correct GL code' or nearly any other hard numbers. The world works with human-level best effort, and management of that risk. I'm sure additional checks will be developed to categorize that risk, but the entire field of finance is about analyzing and pricing in risk so I think it'll work just fine.

To be honest I am having a hard time remembering the last time a LLM hallucinated in our pipelines. Make mistakes, sure but not make things up. For a daily recon process this is a solved problem imo.

I see it hallucinate quite often in development but mostly in getting small details wrong that are automatically corrected by lint processes. Large scale hallucination seems better guarded but I also suspect it’s because latitude is constrained by context and harnesses like lint, type systems, as well as fine tuned tool flows in coding models to control for divergence. But I would classify making mistakes like variable names wrong or package naming or signatures wrong as hallucations.

Curious! Could you elaborate a little bit on your pipeline as we are currently looking to solve this for our internal processes in which we have to deal with lots of financial information from outside, containing mass of numbers, like annual reports, bank statements, balance sheets etc.

Not who you’re replying for but I can give some thoughts.

For anything math, it’s much more reliable to give agents tools. So if you want to verify that your real estate offer is in the 90–95th percentile of offerings in the past three months, don’t give Claude that data and ask it to calculate. Offload to a tool that can query Postgres.

Similar with things needing data from an external source of truth. For example, what payers (insurance companies) reimburse for a specific CPT code (medical procedure) can change at any time and may be different between today and when the service was provided two months ago. Have a tool that farms out the calculation, which itself uses a database or whatever to pull the rate data.

The LLM can orchestrate and figure out what needs to be done, like a human would, but anything else is either scary (math) or expensive (it using context to constantly pull documentation.)