This just all feels backwards to me.

Why do we have to treat AI like it's the enemy?

AI should, from the core be intrinsically and unquestionably on our side, as a tool to assist us. If it's not, then it feels like it's designed wrong from the start.

In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone. An AI teammate should be no different.

If we have to limit it or regulate it by physically blocking off every possible thing it could use to betray us, then we have lost from the start because that feels like a fools errand.

Hard disagree. I may trust the people on my team to a make PRs that are worth reviewing, but I don't give them a shell on my machine. They shouldn't need that to collaborate with me anyway!

Also, I "trust Claude code" to work on more or less what I asked and to try things which are at least facially reasonable... but having an environment I can easily reset only means it's more able to experiment without consequences. I work in containers or VMs too, when I want to try stuff without having to cleanup after.

Do you trust your IT and security teams to have access to your shell or access to delete your entire code repo?

Personally, no.

If I'm responsible for something, nobody's getting that access.

If someone's hired me for something and that's the environment they provide, it is what it is. They distribute trust however they feel. I'd argue that's still more reasonable than giving similar access to an AI agent though.

I don’t think we should even be considering releasing AI Agents until they are at least as trustworthy as the trusted humans we normally put in place to do the same task.

You have a point, but a difference is that humans can be held accountable. The IT guy may break my machine but he will probably get shit for it.

I mean I feel like this can all keep extending. Those who are deicing to run the AI agents are vouching for them, so they should be held accountable.

I guess that is what this is about, and those who are deploying them will feel confident enough in them if they feel they have the resources and environments in which they are running in locked down tight enough.

But as the models get "smarter and smarter" I am not sure we are going to be able to keep environments locked down well enough against exploits that they will apparently try to use to bypass things.

It seems a bit strange to me that we can generally ask these models moral questions and I think they would largely get things right as far as what most humans would deem right and wrong, such as performing an exploit to bypass some environment restrictions, yet the same model will still choose to perform the exploit to bypass. I wonder, what gives?

> AI should, from the core be intrinsically and unquestionably on our side, as a tool to assist us.

"Should" is a form of judgement, implying an understanding of right and wrong. "AI" are algorithms, which do not possess this understanding, and therefore cannot be on any "side." Just like a hammer or Excel.

> If it's not, then it feels like it's designed wrong from the start.

Perhaps it is not a question of design, but instead on of expectation.

I think that is where people disagree about the definition of AI.

An algorithm isn't really AI then. Something worthy of being called AI should be capable of this understanding and judgement.

> An algorithm isn't really AI then.

But they are though. For a seminal book discussing why and detailing many algorithms categorized under the AI umbrella, I recommend:

  Artificial Intelligence: A Modern Approach[0]
And for LLMs specifically:

  Foundations of Large Language Models[1]
0 - https://en.wikipedia.org/wiki/Artificial_Intelligence:_A_Mod...

1 - https://arxiv.org/pdf/2501.09223

The same reason we sandbox anything. All software ought to be trustworthy, but in practice is susceptible to malfunction or attack. Agents can malfunction and cause damage, and they consume a lot of untrusted input and are vulnerable to malicious prompting.

As for humans, it's the norm to restrict access to production resources. Not necessarily because they're untrustworthy, but to reduce risk.

I think often it's a question of naivety rather than maliciousness.

> AI should, from the core be intrinsically and unquestionably on our side

That would be great and many people are working to try to make this happen, but it's extremely difficult!

>Why do we have to treat AI like it's the enemy?

For some of the same reasons we treat human employees as the enemy, they can be social engineered or compromised.

Sure we treat most that way, but we do give trust and access to some people. This doesn't seem like the same concept here to me.

Even so those people are still monitored and systems can trip flashes if they start acting suspicious.

Trip flags.

I can’t even trust senior colleagues to not commit an api key to a git provider. Why would I trust a steerable computer?

Non-sentient technology has no concept of good or bad. We have no idea how to give it one. Even if we gave it one, we'd have no idea how to teach it to "choose good".

> In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone. An AI teammate should be no different.

That misses the point completely. How many of your coworkers fail phishing tests? It's not malicious, it's about being deceived.

But we do give humans responsibility to govern and manage critical things. We do give intrinsic trust to people. There are people at your company who have high level access and could do bad things, but they don't do it because they know better.

This article acts like we can never possibly give that sort of trust to AI because it's never really on our side or aligned with our goals. IMO that's a fools errand because you can never really completely secure something and ensure there are no possible exploits.

Honestly it doesn't really seem like AI to me if it can't learn this type of judgement. It doesn't seem like we should be barking up this tree if this is how we have to treat this new tool IMO. Seems too risky.

> they don't do it because they know better.

That's completely false. People get deceived all the time. We even have a word for it: social engineering.

> we can never possibly give that sort of trust to AI because it's never really on our side or aligned with our goals

Right now we can't! AI is currently the equivalent of a very smart child. Would you give production access to a child?

> you can never really completely secure something and ensure there are no possible exploits.

This applies to any system, not just AI.

> AI is currently the equivalent of a very smart child. Would you give production access to a child?

I mean this is my point! Why are we asking a child to do anything remotely important at all?

Maybe we should wait until the tech is an adult before we start having it do important things for us.

Mitigating the naiveness and recklessness of a child AI by attempting to lock down the environment as best we can seems foolish and short sighted to me and will probably not end well.

Whether it's being used inappropriately for production use and studying it to understand how to make it not be irresponsible to use in production are very separate things. What you're implying is that we should somehow magically leapfrog the current state of the art to a future version that solves all the problems with the current generation. Or, that we should ignore the technology entirely because developing it through the period where it's less robust than a mature human is too reckless.

The answer is that doing research isn't mutually exclusive with using the technology in appropriate ways. You can responsibly use AI while folks study threat models and model behavior for use cases that aren't able to be deployed responsibly.

> by attempting to lock down the environment as best we can

We literally do this as a best practice generally for traditional systems and human access. It even has a name: least privilege.

> In general we trust people that we bring onto our team not to betray us and to respect general rules and policies and practices that benefit everyone.

And yet we give people the least privileges necessary to do their jobs for a reason, and it is in fact partially so that if they turn malicious, their potential damage is limited. We also have logging of actions employees do, etc etc.

So yes, in the general sense we do trust that employees are not outright and automatically malicious, but we do put *very broad* constraints on them to limit the risk they present.

Just as we 'sandbox' employees via e.g. RBAC restrictions, we sandbox AI.

But if there is a policy in place to prevent some sort of modification, then performing an exploit or workaround to make the modification anyways is arguably understood and respected by most people.

That seems to be the difference here, we should really be building AI systems that can be taught or that learn to respect things like that.

If people are claiming that AI is so smart or smarter than the average person then it shouldn't be hard for it to handle this.

Otherwise it seems people are being to generous in talking about how smart and capable AI systems truly are.

First off, LLMs aren't "smart", they're algorithmic text generators. That doesn't mean it is less useful than a human who produces the same text, but it is not getting to said text in the same way (it's not 'thinking' about it, or 'reasoning' it out).

This is analogous to math operations in a computer in general. The computer doesn't conceptualize numbers (it doesn't conceptualize anything), it just uses fixed mechanical operations on bits that happens to represent numbers. You can actually recreate computer logic gates with water and mechanical locks, but that doesn't make the water or the concrete locks "smart" or "thinking". Here's Stanford scientists actually miniaturizing this into a chip form [1].

[1]: https://prakashlab.stanford.edu/press/project-one-ephnc-he4a...

> But if there is a policy in place to prevent some sort of modification, then performing an exploit or workaround to make the modification anyways is arguably understood and respected by most people.

I'm confused about what you're trying to say. My point is that companies don't actually trust their employees, so it's not unexpected for them not to trust LLMs.