Having wrangled many spreadsheets personally, and worked with CFOs who use them to run small-ish businesses, and all the way up to one of top 3 brokerage houses world-wide using them to model complex fixed income instruments... this is a disaster waiting to happen.

Spreadsheet UI is already a nightmare. The formula editing and relationship visioning is not there at all. Mistakes are rampant in spreadsheets, even my own carefully curated ones.

Claude is not going to improve this. It is going to make it far, far worse with subtle and not so subtle hallucinations happening left and right.

The key is really this - all LLMs that I know of rely on entropy and randomness to emulate human creativity. This works pretty well for pretty pictures and creating fan fiction or emulating someone's voice.

It is not a basis for getting correct spreadsheets that show what you want to show. I don't want my spreadsheet correctness to start from a random seed. I want it to spring from first principles.

My first job out of uni was building a spreadsheet infra as code version control system after a Windows update made an eight year old spreadsheet go haywire and lose $10m in a afternoon.

Spreadsheets are already a disaster.

> Spreadsheets are already a disaster.

Yeah, that's what OP said. Now add a bunch of random hallucinations hidden inside formulas inside cells.

If they really have a good spreadsheet solution they've either fixed the spreadsheet UI issues or the LLM hallucination issues or both. My guess is neither.

It's interesting that you mention disaster; there is at least one annual conference dedicated to "spreadsheet risk management".[1]

[1] https://eusprig.org/

Compared to what? Granted, Excel incidents are probably underreported and might produce "silent" consequential losses. But compared to that, for enterprise or custom software in general we have pretty scary estimates of the damages. Like Y2K (between 300-600bn) and the UK Postal Office thing (~1bn).

Excel spreadsheets ARE custom software, with custom requirements, calculations, and algorithms. They're just not typically written by programmers, have no version control or rollback abilities, are not audited, are not debuggable, and are typically not run through QA or QC.

I'll add to this - if you work on a software project to port an excel spreadsheet to real software that has all those properties, if the spreadsheet is sophisticated enough to warrant the process, the creators won't be able to remember enough details abut how they created it to tell you the requirements necessary to produce the software. You may do all the calculations right, and because they've always had a rounding error that they've worked around somewhere else, your software shows calculations that have driven business decisions for decades were always wrong, and the business will insist that the new software is wrong instead of owning some mistake. It's never pretty, and it always governs something extremely important.

Now, if we could give that excel file to an llm and it creates a design document that explains everything is does, then that would be a great use of an LLM.

Thing is, they are also the common workaround solution for savy office workers that don't want to wait for the IT department if it exists, or some outsourced consultancy, to finally deliver something that only does half the job they need.

So far no one has managed to deliver an alternative to spreedsheets that fix this issue, doesn't matter if we can do much better in Python, Java, C# whatever, if it is always over budget and only covers half of the work.

I know, I have taken part in such project, and it run over budget because there was always that little workflow super easy to do in Excel and they would refuse to adopt the tool if it didn't cover that workflow as well.

exactly. And Claude and other code assistants are more of the same, allowing non-programmers[1] to write code for their needs. And that's a good thing overall.

[1] well, people that don't consider themselves programmers.

Agreed. The tradition has been continued by workflow engines, low code tools, platforms like Salesforce and lately AI-builders. The issue is generally not that these are bad, but because they don't _feel_ like software development everyone is comfortable skipping steps of the development process.

To be fair, I've seen shops which actually apply good engineering practices to Excel sheets too. Just definitely not a majority...

Sometimes it isn't that folks are confortable skipping steps, rather they aren't even available.

As so happens in the LLM age, I have been recently having to deal with such tools, and oh boy Smalltalk based image development in the 1990's with Smalltalk/V is so much better in regards to engineering practices than those "modern" tools.

I cannot test code, if I want to backup to some version control system, I have to manually export/import a gigantic JSON file that represents the low-code workflow logic, no proper debugging tools, and so many other things I could rant about.

But I guess this is the future, AI agents based workflow engines calling into SaaS products, deployed in a MACH architecture. Great buzzword bingo, right?

If I could teach managers one lesson, it would be this one.

I know you probably can't share the details, but if you can I (and I'm sure all of us) would love to hear them

In my opinion the biggest use case for spread sheet with LLM is to ask them to build python scripts to do what ever manipulations you want to do with the data. Once people learn to do this workplace productivity would increase greatly I have been using LLM for years now to write python scripts that automate different repeatable tasks. Want a pdf of this data to be overlayed on this file create a python script with an LLM. Want the data exported out of this to be formated and tallied create a script for that.

How will people without Python knowledge know that the script is 100% correct? You can say "Well they shouldn't use it for mission critical stuff" or "Yeah that's not a use case, it could be useful for qualitative analysis" etc., but you bet they will use it for everything. People use ChatGPT as a search engine and a therapist, which tells us enough

If you have a mechanism that can prove arbitrary program correctness with 100% accuracy you’re sitting on something more valuable than LLMs.

so human powered LLM user ??

For sure, I've never seen a human write a bug or make a mistake in programming

that's why we create LLM for that

[dead]

Yesterday I had to pass a bunch of data to finance as the person that does so had left the company. But they wanted me to basically group by a few columns, so instead of spending an hour on this in excel, I created 3 rows of fake data, gave it to the llm, it created a Python script which I ran against the dataset. After manual verification of the results, it could be submitted to finance.

Yeah I am not a programmer just more tech literate than most as I have always been fascinated by tech. I think people are missing the forest for the trees when it comes to LLMS. I have been using them to create simple bash, bat, python scripts. Which I would not have been able to put together before even with weeks of googling. I say that because I used to do that unsuccessfully but my success rate thorugh the roof with LLM's.

Now I just ask an LLM to create the scripts and explain all the steps. If it is a complex script I would also ask it to add logging to the script so that I can feed the log back to the LLM and explain what is going wrong which allowed for a lot faster fixes. In the early days I and the LLM would be going around in circles till I hit the token limits. And to start from scratch again.

Learn python, the subscription for that knowledge won't be jacked up to 2000$/month when the VC money drys up.

That’s exactly how it should be done if accuracy is important.

Congrats? But you are not likely a typical user.

Just learn python, what are you a child?

Basic python knowledge should be a requirement for any office job.

LLMs is a retarded way of spending trillions automating what can be done with good old reliable scripting. We haven't automated shit yet.

Yeah, it's like that commercial for OpenAI (or was it Gemini?) where the guy says it lets the tool work on it's complex financial spreadsheets, goes for a walk with a dog, gets back and it is done with "like 98% accuracy". I cannot imagine what the 2% margin of error looks like for a company that moves around hundreds of billions of dollars...

I don't think tools like Claude are there yet, but I already trust GPT-5 Pro to be more diligent about catching bugs in software than me, even when I am trying to be very careful. I expect even just using these tools to help review existing Excel spreadsheets could lead to a significant boost in quality if software is any guide (and Excel spreadsheets seem even worse than software when it comes to errors).

That said, Claude is still quite behind GPT-5 in its ability to review code, and so I'm not sure how much to expect from Sonnet 4.5 in this new domain. OpenAI could probably do better.

> That said, Claude is still quite behind GPT-5 in its ability to review code, and so I'm not sure how much to expect from Sonnet 4.5 in this new domain. OpenAI could probably do better.

It’s always interesting to see others opinions as it’s still so variable and “vibe” based. Personally, for my use, the idea that any GPT-5 model is superior to Claude just doesn’t resonate - and I use both regularly for similar tasks.

I also find the subjective nature of these models interesting, but in this case the difference in my experiences between Sonnet 4.5 and GPT-5 Codex, and especially GPT-5 Pro, for code review is pretty stark. GPT-5 is consistently much better at hard logic problems, which code review often involves.

I have had GPT-5 point out dozens of complex bugs to me. Often in these cases I will try to see if other models can spot the same problems, and Gemini has occasionally but the Claude models never have (using Opus 4, 4.1, and Sonnet 4.5). These are bugs like complex race conditions or deadlocks that involve complex interactions between different parts of the codebase. GPT-5 and Gemini can spot these types of bugs with a decent accuracy, while I’ve never had Claude point out a bug like this.

If you haven’t tried it, I would try the codex /review feature and compare its results to asking Sonnet to do a review. For me, the difference is very clear for code review. For actual coding tasks, both models are much more varied, but for code review I’ve never had an instance where Claude pointed out a serious bug that GPT-5 missed. And I use these tools for code review all the time.

I've noticed something similar. I've been working on some concurrency libraries for elixir and Claude constantly gets things wrong, but GPT5 can recognize the techniques I'm using and the tradeoffs.

Try the TypeScript codex CLI with the gpt-5-codex model with reasoning always set to high, or GPT-5 Pro with max reasoning. Both are currently undeniably better than Claude Opus 4.1 or Sonnet 4.5 (max reasoning or otherwise) for all code-related tasks. Much slower but more reliable and more intelligent.

I've been a Claude Code fanboy for many months but OpenAI simply won this leg of the race, for now.

Same. I switched from sonnet 4 when it was out to codex. Went back to try sonnet 4.5 and it really hates to work for longer than like 5 minutes at a time

Codex meanwhile seems to be smarter and plugs away at a massive todo list for like 2 hours

[deleted]

Having AI create the spreadsheet you want is totally possible, just like generating bash scripts works well. But to get good results, there needs to be some documentation describing all the hidden relationships and nasty workarounds first.

Don't try to make LLMs generate results or numbers, that's bound to fail in any case. But they're okay to generate a starting point for automations (like Excel sheets with lots of formulas and macros), given they get access to the same context we have in our heads.

I like this take. There seems to be an over-focus on 'one-shot' results, but I've found that even the free tools are a significant productivity booster when you focus on generating smaller pieces of code that you can verify. Maybe I'm behind the power curve since I'm not leveraging the full capability of the advanced LLM's, but if the argument is disaster is right around the corner due to potential hallucinations, I think we should consider that you still have to check your work for mission critical systems. That said, I don't really build mission critical systems - I just work in Aerospace Engineering and like building small time saving scripts / macros for other engineers to use. For this use, free LLMs even have been huge for me. Maybe I'm in a very small minority, but I do use Excel & Python nearly every day.

I tend to agree that dropping the tool as it is into untrained hands is going to be catastrophic.

I’ve had similar professional experiences as you and have been experimenting with Claude Code. I’ve found I really need to know what I’m doing and the detail in order to make effective (safe) use out of it. And that’s been a learning curve.

The one area I hope/think it’s closest to (given comments above) is potentially as a “checker” or validator.

But even then I’d consider the extent to which it leaks data, steers me the wrong way, or misses something.

The other case may be mocking up a simple financial model for a test / to bounce ideas around. But without very detailed manual review (as a mitigating check), I wouldn’t trust it.

So yeah… that’s the experience of someone who maybe bridges these worlds somewhat… And I think many out there see the tough (detailed) road ahead, while these companies are racing to monetize.

IMO people tend to over-trust both AI and Excel. Maybe this will recalibrate that after it leads to a catastrophic business failure or two.

You would hope so. But how many companies have actually changed their IT policy of outsourcing everything to Tata Consultancy Services (or similar) where a sweaty office in Mumbai full of people who don't give a shit run critical infrastructure?

Jaguar Landrover had production stopped for over a month I think, and 100+ million impact to their business (including a trail of smaller suppliers put near bankruptcy). I'd bet Tata are still there and embedded even further in 5 years.

If AI provides some day-to-day running cost reduction that looks good on quarterly financial statements it will be fully embraced, despite the odd "act of god".

to be clear, tata owns JLR.

Indeed, that slipped my mind. However the Marks and Spencer hack was also their fault. Just searching on it now it seems there is a ray of hope. Although i have a feeling the response won't be a well trained onshore/internal IT department. It will be another offshore outsourcing jaunt but with better compensation for incompetent staff on the outsourcers side.

"Marks & Spencer Cuts Ties With Tata Consultancy Services Amid £300m Cyber Attack Fallout" (ibtimes.co.uk)

My take is more optimistic. This could be an off ramp to stop putting critical business workflows in spreadsheets. If people start to learn that general purpose programming languages are actually easier than Excel (and with LLMs, there is no barrier), then maybe more robust workflows and automation will be the norm.

I think the world would be a lot better off if excel weren’t in it. For example, I work at business with 50K+ employees where project management is done in a hellish spreadsheet literally one guy in Australia understands. Data entry errors can be anywhere and are incomprehensible. 3 or 4 versions are floating around to support old projects. A CRUD app with a web front end would solve it all. Yet it persists because Excel is erroneously seen as accessible whereas Rails, Django, or literally anything else is witchcraft.

There was never a barrier to automating your office work with python unless you are a moron.

Who fooled the world scripting some known work flow of yours is fucking rocket science. It should be a requirement to even enter the fucking office building.

> all LLMs that I know of rely on entropy and randomness to emulate human creativity

Those are tuneable parameters. Turn down the temperature and top_p if you don't want the creativity.

> Claude is not going to improve this.

We can measure models vs humans and figure this out.

To your own point, humans already make "rampant" mistakes. With models, we can scale inference time compute to catch and eliminate mistakes, for example: run 6x independent validators using different methodologies.

One-shot financial models are a bad idea, but properly designed systems can probably match or beat humans pretty quickly.

> Turn down the temperature and top_p if you don't want the creativity.

This also reduces accuracy in real terms. The randomness is used to jump out of local minima.

That's at training time, not inference time. And temp/top_p aren't used to escape local minima, methods like SDG batch sampling, Adam, dropout, LR decay, and other techniques do that.

Ahh okay, so you really can't escape the indeterminacy?

You can zero out temperature and get determinism at inference time. Which is separate from training time where you need forms of randomness to learn.

The point is for the quote "all LLMs that I know of rely on entropy and randomness to emulate human creativity" is a runtime parameter you can tweak down to zero, not a fundamental property of the technology.

Right, but my point is is that even if you turn the temperature all the way down, you're not guaranteed to get an accurate or truthful result even though you may get a mostly repeatable deterministic result, and there is still some indeterminacy.

> Those are tuneable parameters. Turn down the temperature and top_p if you don't want the creativity.

Ah yes, we'll tell Mary from the Payroll she could just tune them parameters if there is more than "like 2%" error in her spreadsheets

No one said it was a user setting. The person building the spreadsheet agent system would tune the hyper-parameters with a series of eval sets.

Technically it’s deterministic. It just might not be correct :)

Is this just a feeling you have or is this downstream of actual use cases you've applied AI to observed and measured reliability on?

Not the parent poster, but this is pretty much the foundation of LLMs. They are by their nature probabilistic, not deterministic. This is precisely what the parent is referring to.

All processes in reality, everywhere, are probablistic. The entire reason "engineering" is not the same as theoretical mathematics is about managing these probabilities to an acceptable level for the task you're trying to perform. You are getting a "probablistic" output from a human too. Human beings are not guaranteeing theoretically optimal excel output when they send their boss Final_Final_v2.xlsx. You are using your mental model of their capabilities to inform how much you trust the result.

Building a process to get a similar confidence in LLM output is part of the game.

I have to disagree. There are many areas where things are extremely deterministic, regulated financial services being one of those areas. As one example of zillions, look at something like Bond Math. All of it is very well defined, all the way down to what calendar model you will you use (360/30 or what have you), rounding, etc. It's all extremely well defined specifically so you can get apple to apple comparisons in the market place.

The same applies to my checkbook, and many other areas of either calculating actuals or where future state is well defined by a model.

That said, there can be a statistical aspect to any spreadsheet model. Obviously. But not all spreadsheets are statistical, and therein lies the rub. If an LLM wants to hallucinate a 9,000 day yearly calendar because it confuses our notion of a year with one of the outer planets, that falls well within probability, but not within determinism following well define rules.

The other side of the issue is LLMs trained on the Internet. What are the chances that Claude or whatever is going to make a change based on a widely prevalent but incorrect spreadsheet it found on some random corner of the Internet? Do I want Claude breaking my well-honed spreadsheet because Floyd in Nebraska counted sheep wrong in a spreadsheet he uploaded and forgot about 5 years ago, and Claude found it relevant?

Yup. It becomes clearer to me when I think about the existing validators. Can these be improved, for sure.

It’s when people make the leaps to the multi-year endgame and in their effort to monetise by building overconfidence in the product where I see the inherent conflict.

It’s going to be a slog… the detailed implementations. And if anyone is a bit more realistic about managing expectations I think Anthropic is doing it a little better.

> All processes in reality, everywhere, are probablistic.

If we want to go in philosophy then sure, you're correct, but this not what we're saying.

For example, an LLM is capable (and it's highly plausible for it to do so) of creating a reference to a non-existent source. Humans generally don't do that when their goal is clear and aligned (hence deterministic).

> Building a process to get a similar confidence in LLM output is part of the game.

Which is precisely my point. LLMs are supposed to be better than humans. We're (currently) shoehorning the technology.

> Humans generally don't do that when their goal is clear and aligned (hence deterministic).

Look at the language you're using here. Humans "generally" make less of these kinds of errors. "Generally". That is literally an assessment of likelihood. It is completely possible for me to hire someone so stupid that they create a reference to a non-existent source. It's completely possible for my high IQ genius employee who is correct 99.99% of the time to have an off-day and accidentally fat finger something. It happens. Perhaps it happens at 1/100th of the rate that an LLM would do it. But that is simply an input to the model of the process or system I'm trying to build that I need to account for.

When humans make mistakes repeatedly in their job they get fired.

Not OP but using LLMs in any professional setting, like programming, editing or writing technical specifications, OP is correct.

Without extensive promoting and injectimg my own knowledge and experience, LLMs generate absolute unusable garbage (on average). Anyone who disagrees very likely is not someone who would produce good quality work by themselves (on average). That's not a clever quip; that's a very sad reality. SO MANY people cannot be bothered to learn anything if they can help it.

The triad of LLM dependencies in my view: initiation of tasks, experience based feedback, and consequence sink. They can do none of these, they all connect to the outer context which sits with the user, not the model.

You know what? This is also not unlike hiring a human, they need the hirer party tell them what to do, give feedback, and assume the outcomes.

It's all about context which is non-fungible and distributed, not related to intelligence but to the reason we need intelligence for.

> Anyone who disagrees very likely is not someone who would produce good quality work by themselves (on average).

So for those producing slop and not knowing any better (or not caring), AI just improved the speed at which they work! Sounds like a great investment for them!

For many mastering any given craft might not be the goal, but rather just pushing stuff out the door and paying bills. A case of mismatched incentives, one might say.

I would completely disagree. I use LLMs daily for coding. They are quite far from AGI and it does not appear they are replacing Senior or Staff Engineers any time soon. But they are incredible machines that are perfectly capable of performing some economically valuable tasks in a fraction of the time it would have taken a human. If you deny this your head is in the sand.

Capable, yeah, but not reliable, that's my point. They can one shot fantastic code, or they can one shot the code I then have to review and pull my hair out over for a week, because it's such crap (and the person who pushed it is my boss, for example, so I can't just tell him to try again).

That's not consistent.

You can ask your boss to submit PRs using Codex’s “try 5 variations of the same task and select the one you like most though

Surely at that point they could write the code themselves faster than they can review 5 PRs.

Producing more slop for someone else to work through is not the solution you think it is.

Why do you frame the options as "one shot... or... one shot"?

Because lazy people will use it like that, and we are all inherently lazy

It's not much better with planning either. The amount of time I spent planning, clarifying requirements, hand-holding implementation details always offset any potential savings.

Have you never used one to hunt down an obscure bug and found the answer quicker than you likely would have yourself?

Actually, yeah, a couple of times, but that was a rubber-ducky approach; the AI said something utterly stupid, but while trying to explain things, I figured it out. I don't think an LLM has solved any difficult problem for me before. However, I think I'm likely an outlier because I do solve most issues myself anyways.

> Mistakes are rampant in spreadsheets

To me, the case for LLMs is strongest not because LLMs are so unusually accurate and awesome, but because if human performance were put on trial in aggregate, it would be found wanting.

Humans already do a mediocre job of spreadsheets, so I don't think it is a given that Claude will make more mistakes than humans do.

But isn't this only fine as long someone who knows what they are doing has oversight and can fix issues when they arise and Claude gets stuck?

Once we all forget how to write SUM(A:A), will we just invent a new kind of spreadsheet once Claude gets stuck?

Or in other words; what's the end game here? LLMs clearly cannot be left alone to do anything properly, so what's the end game of making people not learn anything anymore?

Well the end game with AI is AGI of course. But realistically the best case scenario with LLM’s is having fewer people with the required knowledge, leveraging LLM’s to massively enhance productivity.

We’re already there to some degree. It is hard to put a number on my productivity gain, but as a small business owner with a growing software company it’s clear to me already that I can reduce developer hiring going forward.

When I read the skeptics I just have to conclude that they’re either poor at context building and/or work on messy, inconsistent and poorly documented projects.

My sense is that many weaker developers who can’t learn these tools simply won’t compete in the new environment. Those who can build well designed and documented projects with deep context easy for LLM’s to digest will thrive.

I assume all of this applies to spreadsheets.

Why isn't there a single study that would back up your observations? The only study with a representative experimental design that I know about is the METR study and it showed the opposite. Every study citing significant productivity improvements that I've seen is either:

- relying on self-assessments from developers about how much time they think they saved, or

- using useless metrics like lines of code produced or PRs opened, or

- timing developers on toy programming assignments like implementing a basic HTTP server that aren't representative of the real world.

Why is it that any time I ask people to provide examples of high quality software projects that were predominantly LLM-generated (with video evidence to document the process and allow us to judge the velocity), nobody ever answers the call? Would you like to change that?

My sense is that weaker developers and especially weaker leaders are easily impressed and fascinated by substandard results :)

Everything Claude does is reviewed by me, nothing enters the code base that doesn’t meet the standard we’ve always kept. Perhaps I’m sub standard and weak but my software is stable, my customers are happy, and I’m delivering value to them quicker than I was previously.

I don’t know how you could effectively study such a thing, that avenue seems like a dead end. The truth will become obvious in time.

Okay, and now you give those mediocre humans a tool hat is both great and terrible. The problem is, unless you know your way around very well, they won't know which is which.

Since my company uses Excel a lot, and I know the basics but don't want to become an expert, I use LLMs to ask intermediate questions, too hard to answer with the few formulas I know, not too hard for a short solution path.

I have great success and definitely like what I can get with the Excel/LLM combo. But if my colleagues used it the same way, they would not get my good results, which is not their fault, they are not IT but specialists, e.g. for logistics. The best use of LLMs is if you could already do the job without them, but it saves you time to ask them and then check if the result is actually acceptable.

Sometimes I abandon the LLM session, because sometimes, and it's not always easy to predict, fixing the broken result would take more effort than just doing it the old way myself.

A big problem is that the LLMs are so darn confident and always present a result. For example, I point it to a problem, it "thinks", and then it gives me new code and very confidently summarizes what the problem was, correctly, that it now for sure fixed the problem. Only that when I actually try the result has gotten worse than before. At that point I never try to get back to a working solution by continuing to try to "talk" to the AI, I just delete that session and do another, non-AI approach.

But non-experts, and people who are very busy and just want to get some result to forward to someone waiting for it as quickly as possible will be tempted to accept the nice looking and confidently presented "solution" as-is. And you may not find a problem until half a year later somebody finds that prepayments, pro forma bills and the final invoices don't quite match in hard to follow ways.

Not that these things don't happen now already, but adding a tool with erratic results might increase problems, depending on actual implementation of the process. Which most likely won't be well thought out, many will just cram in the new tool and think it works when it doesn't implode right away, and the first results, produced when people still pay a lot of attention and are careful, all look good.

I am in awe of the accomplishments of this new tool, but it is way overhyped IMHO, still far too unpolished and random. Forcing all kinds of processes and people to use it is not a good match, I think.

This is a great point. LLMs make good developers better, but they make bad developers even worse. LLMs multiply instead of add value. So if you're a good developer, who is careful, pays attention, watches out for trouble, and is constantly reviewing and steering, the LLM is multiplying by a positive number and will make you better. However, if you're a mediocre/bad developer, who is not careful, who lacks attention to detail, and just barely gets things to compile / run, then the LLM is multiplying by a negative number and will make your output even worse.

You can do it cursor style

Or you could, you know, read the article before commenting to see the limited scope of this integration?

Anyway, Google has already integrated Gemini into Sheets, and recently added direct spreadsheet editing capability so your comment was disproven before you even wrote it

> The key is really this - all LLMs that I know of rely on entropy and randomness to emulate human creativity. This works pretty well for pretty pictures and creating fan fiction or emulating someone's voice.

I think you need to turn down the temperature a little bit. This could be a beneficial change.