The vast majority of people in business and science are using spreadsheets for complex algorithmic things they weren't really designed for, and we find a metric fuckton of errors in the sheets when you actually bother looking auditing them, mistakes which are not at all obvious without troubleshooting by... manually checking each and every cell & cell relation, peering through parentheses, following references. It's a nightmare to troubleshoot.
LLMs specialize in making up plausible things with a minimum of human effort, but their downside is that they're very good at making up plausible things which are covertly erroneous. It's a nightmare to troubleshoot.
There is already an abject inability to provision the labor to verify Excel reasoning when it's composed by humans.
I'm dead certain that Claude will be able to produce plausibly correct spreadsheets. How important is accuracy to you? How life-critical is the end result? What are your odds, with the current auditing workflow?
Okay! Now! Half of the users just got laid off because management thinks Claude is Good Enough. How about now?
I'd say the vast majority of Excel users in business are working off of a CSV sent from their database/ERP team or exported from a self-serve analytics tool and using pivot tables to do the heavy lifting, where it's nearly impossible to get something wrong. Investment banks and trading desks are different, and usually have an in-house IT team building custom extensions into Excel or training staff to use bespoke software. That's still a very small minority of Excel users.
LLMs are getting quite good at reviewing the results and implementations, though
Not really, they're only as good as their context and they do miss and forget important things. It doesn't matter how often, because they do, and they will tell you with 100% confidence and with every synonym of "sure" that they caught it all. That's the issue.
I am very confident that these tools are better than the median programmer at code review now. They are certainly much more diligent. An actually useful standard to compare them to is human review, and for technical problems, they definitely pass it. That said, they’re still not great at giving design feedback.
But GPT-5 Pro, and to a certain extent GPT-5 Codex, can spot complex bugs like race conditions, or subtly incorrect logic like memory misuse in C, remarkably well. It is a shame GPT-5 Pro is locked behind a $200/month subscription, which means most people do not understand just how good the frontier models are at this type of task now.