> We’re releasing ten ready-to-run agent templates for the most time-consuming work in financial services

The templates being: pitch builder, meeting preparer, earnings reviewer, model builder, market researcher, valuation reviewer, general ledger reconciler, month-end closer, statement auditor, KYC (Know Your Customer) screener.

Seems pretty scattershot. Reminds me of GPT Store.

the details are key here. there is plenty of automatable financial work, sure, but also when it comes to reporting finances/costs (formally or informally) and having a real human being be accountable for them, you REALLY need to trust that nothing is hallucinated.

Any idea how they ensure this doesnt happen? As in, how can a user verify that the model did not touch any of the numbers and that it only built pipelines for them.

what I've been telling my CFO who wants to get AI involved in things is that for a lot of accounting and finance work "Trust but verify" doesnt work because verify is often the same process as doing the work.

> Any idea how they ensure this doesnt happen?

Build a deterministic query set and automate it for monthly or daily reporting reconcilliation.

Leave AI out of it.

The "real humans" doing the tasks being replaced are overworked kids less than 2yrs out of college on an average of 4hrs of sleep at working at 3am. If the AI makes their jobs take half as much time I bet they're a lot more likely to catch errors (and live longer).

at risk of sounding facetious, how exactly do you catch an error in a sum without performing the sum yourself?

How do you verify that all the tariffs are properly allocated to the correct GL code without going through the invoices and checking for each tariff on the list? How do you make sure none were accidentally assigned to other GL codes? All you have is pdfs, you dont know what the AI did or didnt do with the info on the pdf, there are not many use-cases to catch its errors without doing the work yourself.

If anything, it's going to add a step to these "kids" work where they have to use the AI to do the work and then redo 90% of the work anyway just to verify the output and then AI is going to get the credit anyway.

Or the overworked people are going to use AI and not verify it, which means not catching any errors or hallucinations, which apparently is fine because someone claims it's a solved problem for the black box of infinite possibility and inconsistent output.

It's like self-driving cars. You might want to accept human fault error rates until we prove overwhelmingly that the software is near-perfect, but others might want to switch to a system once it proves that it reliably beats most humans by a large factor, then work to mitigate the common errors it does have and improve.

When management signs off on work (SOX requires CEOs and CFOs to personally certify the accuracy of financial reports), they do not personally 'verify that all the tariffs are properly allocated to the correct GL code' or nearly any other hard numbers. The world works with human-level best effort, and management of that risk. I'm sure additional checks will be developed to categorize that risk, but the entire field of finance is about analyzing and pricing in risk so I think it'll work just fine.

To be honest I am having a hard time remembering the last time a LLM hallucinated in our pipelines. Make mistakes, sure but not make things up. For a daily recon process this is a solved problem imo.

I see it hallucinate quite often in development but mostly in getting small details wrong that are automatically corrected by lint processes. Large scale hallucination seems better guarded but I also suspect it’s because latitude is constrained by context and harnesses like lint, type systems, as well as fine tuned tool flows in coding models to control for divergence. But I would classify making mistakes like variable names wrong or package naming or signatures wrong as hallucations.

Curious! Could you elaborate a little bit on your pipeline as we are currently looking to solve this for our internal processes in which we have to deal with lots of financial information from outside, containing mass of numbers, like annual reports, bank statements, balance sheets etc.

Not who you’re replying for but I can give some thoughts.

For anything math, it’s much more reliable to give agents tools. So if you want to verify that your real estate offer is in the 90–95th percentile of offerings in the past three months, don’t give Claude that data and ask it to calculate. Offload to a tool that can query Postgres.

Similar with things needing data from an external source of truth. For example, what payers (insurance companies) reimburse for a specific CPT code (medical procedure) can change at any time and may be different between today and when the service was provided two months ago. Have a tool that farms out the calculation, which itself uses a database or whatever to pull the rate data.

The LLM can orchestrate and figure out what needs to be done, like a human would, but anything else is either scary (math) or expensive (it using context to constantly pull documentation.)

I'll be honest, I thought the first few items on your list of time consuming work was sarcasm.

A recent episode of Matt Levine’s podcast (Money Stuff) covered this: apparently investment bankers spend a huge amount of time preparing pitch decks for companies that don’t want them. Apparently Claude is quite good at making a pitch deck that no one but your boss wants or cares about.

I feel like there’s a metaphor in there... maybe I’ll ask Claude about it.

Much like a lot of internal daily status report stuff: The BS generator is actually a great fit when the task is making BS output nobody used or deeply cared-about in the first place.

> Much like a lot of internal daily status report stuff

Everyone wants in on my daily excel auto generated reports - nobody ever opens them. Just being on the list makes you someone.

Reads different to me. Some examples to go run with and build your own. Covers cases from the investment side and then the obvious ones in an accounting perspective. It would be highly surprising that any of these would be use in production without modification. I am sure it will happen but the intent to me is to take this and run with your own process.

I find all of these .md files released by the labs to be ai generated slop. The only exception being maybe the /simplify command

"Claude, build me 50 skills an Account Analyst would find useful, then run them through the agent at maxxxx thinking and ship the top 10 of them"

My money's on that.

It still surprises me how effective the /simplify skill is.

I’ve also had some great results with a /reflect skill that asks the agent to look at the work in the broader context of the project. But those are the only two skills I use regularly that aren’t specific to our company, codebase, or tools.

No surprise there. Of course the skill files are not human written.

The AI is an expert in both following and generating prompts.

Why do you think it is an expert in generating prompts? It has no additional insight into how it works internally than anyone else

Do you really think a random person off the street knows more about how LLMs work internally than the latest frontier model (that has been trained on that material)?

No, but a random person off the street also isn't making skills for LLMs.

I think that LLMs are trained on the millions of vibe written LLM blog posts that are more superstition than fact. There is a lot of snake oil out there that is treated as fact. If someone claims that an LLM is better than humans at something I always want to see the rigorous evaluations that have been done to quantify it, not "but they're trained on everything!"