sure sure, except llms. I mean its valid and all bringing up tried and true maxims that we all should know regarding software, but whens the last time the ssl guys were happy with a fix that "has a chance of working, but a chance of not working."

defense in depth is to prevent one layer failure from getting to the next, you know, exploit chains etc. Failure in a layer is a failure, not statistically expected behavior. we fix bugs. what we need to do is treat llms as COMPLETELY UNTRUSTED user input as has been pointed out here and elsewhere time and again.

you reply to me like I need to be lectured, so consider me a dumb student in your security class. what am I missing here?

> you reply to me like I need to be lectured

That's not my intention! Just stating how we're thinking about this.

> defense in depth is to prevent one layer failure from getting to the next

We think a separate model can help with one layer of this: checking if the planner model's actions are aligned with the user's request. But we also need guarantees at other layers, like distinguishing web contents from user instructions, or locking down what tools the model has access to in what context. Fundamentally, though, like we said in the blog post:

"The attack we developed shows that traditional Web security assumptions don’t hold for agentic AI, and that we need new security and privacy architectures for agentic browsing."

"But we also need guarantees at other layers, like distinguishing web contents from user instructions"

How do you intend to do that?

In the three years I've spent researching and writing about prompt injection attacks I haven't seen a single credible technique from anyone that can distinguish content from instructions.

If you can solve that you'll have solved the entire class of prompt injection attacks!

> I haven't seen a single credible technique from anyone that can distinguish content from instructions

You specifically mean that it's ~impossible to distinguish between content and instructions ONCE it is fed to the model, right? I agree with that. I was talking about a prior step, at the browser level. At the point that the query is sent to the backend, the browser would be able to distinguish between web contents and user prompt. This is useful for checking user-alignment of the output of the reasoning model (keeping in mind that the moment you feed in untrusted text into a model all bets are off).

We're actively thinking and working on this, so will have more to announce soon, but this discussion is useful!

Even if you know the source of the text before you feed it to the model you still need to solve the problem of how to send untrusted text from a user through a model without that untrusted text being able to trigger additional tool calls or actions.

The most credible pattern I've seen for that comes from the DeepMind CaMeL paper - I would love to see a browser agent that robustly implemented those ideas: https://simonwillison.net/2025/Apr/11/camel/

> Even if you know the source of the text before you feed it to the model you still need to solve the problem of how to send untrusted text from a user through a model without that untrusted text being able to trigger additional tool calls or actions.

We're exploring taking the action plan that a reasoning model (which sees both trusted and untrusted text) comes up with and passing it to a second model, which doesn't see the untrusted text and which then evaluates it.

> The most credible pattern I've seen for that comes from the DeepMind CaMeL paper

Yeah we're aware of the CaMeL paper and are looking into it, but it's definitely challenging from an implementation pov.

Also, I see that we said "The browser should clearly separate the user’s instructions from the website’s contents when sending them as context to the model" in the blog post. That should have been "backend", not "model". Agreed that once you feed both trusted and untrusted tokens into the LLM the output must be considered unsafe.

>We're exploring taking the action plan that a reasoning model (which sees both trusted and untrusted text) comes up with and passing it to a second model, which doesn't see the untrusted text and which then evaluates it.

How is this different from the Dual-LLM pattern that’s described in the link that was posted? It immediately describes how that setup is still susceptible to prompt injection.

>With the Dual LLM pattern the P-LLM delegates the task of finding Bob’s email address to the Q-LLM—but the Q-LLM is still exposed to potentially malicious instructions.

Operating systems solved this with "mark of the web". Distinguishing data from instructions seems to be only part of the problem (and the easier one—presumably tools could label data downloaded from external sources accordingly at runtime). The harder problem seems to be blocking execution of instructions in data while still being able to use the data to generate a response.

> what am I missing here?

I guess what I don't understand is that failure is always expected because nothing is perfect, so why isn't the chance of failure modeled and accounted for? Obviously you fix bugs, but how many more bugs are in there you haven't fixed? To me, "we fix bugs" sounds the same as "we ship systems with unknown vulnerabilities".

What's the difference between a purportedly "secure" feature with unknown, unpatched bugs; and an admittedly insecure feature whose failure modes are accounted for through system design taking that insecurity into account, rather than pretending all is well until there's a problem that surfaces due to unknown exploits?

The “secure” system with unknown bugs can fix them once they become known. The system that’s insecure by design and tries to mitigate it can’t be fixed, by design.

There might be a zero-day bug in my browser which allows an attacker to steal my banking info and steal my money. I’m not very worried about this because I know that if such a thing is discovered, Apple is going to fix it quickly. And it’s going to be such a big deal that it’s going to make the news, so I’ll know about it and I can make an informed decision about what to do while I wait for that fix.

Computer security is fundamentally about separating code from data. Security vulnerabilities are almost always bugs that break through that separation. It may be direct, like with a buffer overflow into executable memory or a SQL injection, or it may be indirect with ROP and such. But one way or another, it comes down to getting the target to run code it’s not supposed to.

LLMs are fundamentally designed such that there is no barrier between the two. There’s no code over here and data over there. The instructions are inherently part of the data.

I think you're correct with accounting for the security "attributes" of these llms if you're going to use them, like you said, "taking that insecurity into account".

If we sit down and examine the statistics of bugs, the costs of their occurance in production and weighed everything with some reasonable criteria, I think we could somehow arrive at a reasonable level of confidence that allows us to ship a system to production. Some organizations do better with this than others of course. During a projects development cycle, we could watch out for common patterns, buffer overflows, use after free for c folks, sql injection or non escaping stuff in web programming but we know these are mistakes and we want to fix them.

With llms the mitigation that I'm seeing is that we reduce the errors 90 percent, but this is not a mitigation unless we also detect and prevent the other 10 percent. Its just much more straightforward to treat llms as untrusted, because they are, you're getting input from randos by virtue of its training data. producing mistaken output is not actually a bug, its actually expected behavior, unless you also believe in the tooth fairy lol

>To me, "we fix bugs" sounds the same as "we ship systems with unknown vulnerabilities".

to me, they sound different ;)

> what am I missing here?

Yeah the tone of that response seems unnecessarily smug.

“I’m working on removing your front door and I’m designing a really good ‘no trespassing’ sign. Only a simpleton would question my reasoning on this issue”