> after using it for months you get a ‘feel’ for what kind of mistakes it makes

Sure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.

What, you don't trust the vibes? Are you some sort of luddite?

Anyways, try a point release upgrade of a SOTA model, you're probably holding it wrong.

why yes, yes I am. ;-)

not betting my entire operation - if the only thing stopping a bad 'deploy' command destroying your entire operation is that you don't trust the agent to run it, then you have worse problems than too much trust in agents.

I similarly use my 'intuition' (i.e. evidence-based previous experiences) to decide what people in my team can have access to what services.

I'm not saying intuition has no place in decision making, but I do take issue with saying it applies equally to human colleagues and autonomous agents. It would be just as unreliable if people on your team displayed random regressions in their capabilities on a month to month basis.

> bet your entire operation

What straw man is doing that?

Reports of people losing data and other resources due to unintended actions from autonomous agents come out practically every week. I don't think it's dishonest to say that could have catastrophic impact on the product/service they're developing.

looking at the reddit forum, enough people to make interesting forum posts.

So like every software? Why do you think there are so many security scanners and whatnot out there?

There are millions of lines of code running on a typical box. Unless you're in embedded, you have no real idea what you're running.

...No, it's not at all "like every software".

This seems like another instance of a problem I see so, so often in regard to LLMs: people observe the fact that LLMs are fundamentally nondeterministic, in ways that are not possible to truly predict or learn in any long-term way...and they equate that, mistakenly, to the fact that humans, other software, what have you sometimes make mistakes. In ways that are generally understandable, predictable, and remediable.

Just because I don't know what's in every piece of software I'm running doesn't mean it's all equally unreliable, nor that it's unreliable in the same way that LLM output is.

That's like saying just because the weather forecast sometimes gets it wrong, meteorologists are complete bullshit and there's no use in looking at the forecast at all.

> ...No, it's not at all "like every software"

Yes, they are; through the lens the person above offered that is.

In practice, all we ever get to deal with is empirical/statistical, and the person above was making an argument where they singled out LLMs for being statistical. You may reject me taking an issue with this on principled grounds, because regular programs are just structured logic, but they cease to be just that once you actually run them. Real hardware runs them. Even fully verified, machine-checked, correctly designed/specified software, only interacting with other such software, can enter into an inconsistent state through no fault of its own. Theory stops being theory once you put it in practice. And the utmost majority of programs fail the aforementioned criteria to begin with.

> people observe the fact that LLMs are fundamentally nondeterministic

LLMs are not "non-deterministic", let alone fundamentally so. If I launch a model locally, pin the seed, and ask the exact same question 10x, I'll get the same answer every single time down to the very byte. Provided you select your hardware and inference engine correctly, the output remains reproducible even across different machines. They're not even stateful! You literally send along the entire state (context window) every single time.

Now obviously, you might instead mean a more "practical" version of this, their general semantic unpredictability. But even then, every now and then I do ask the "same" question to LLMs, and they keep giving essentially the "same" response. They're pretty darn semantically stable.

> In ways that are generally understandable, predictable, and remediable.

You could say the same thing about the issue in the OP. You have a very easy to understand issue that behaves super predictably, and will be (imo) remediable just fine by the various service providers.

Now think of all the hard to impossible to reproduce bugs people just end up working around. The never ending list of vulnerabilities and vulnerability categories. The inexplicable errors that arise due to real world hardware issues. Yes, LLMs are statistical in nature, not artisanally hardwired. But in the end, they're operated in the same empirical way, along the same lines of concerns, and with surprisingly similar outcomes and consequences at times.

You're not going to understand the millions (or really, tens or hundreds of millions) of lines of code running on a typical machine. You'll never be able to exhaustively predict their behavior (especially how they interact with terabytes of data or more over time) and the defects contained within. You'll never remediate those defects fully. Hell, even for classes of problems where such a thing would be possible to achieve structurally, people are resisting the change.

If they want to take an issue with LLMs, a plain gesturing at their statistical nature is just not particularly convincing. Not in a categorical, drop the mic way at least, that's for sure.

>That's like saying just because the weather forecast sometimes gets it wrong, meteorologists are complete bullshit and there's no use in looking at the forecast at all.

Are you really not seeing that GP is saying exactly this about LLMs?

What you want for this to be practical is verification and low enough error rate. Same as in any human-driven development process.

[dead]