> so people can't trick them to attack others' systems under the pretense of pentesting
A while back I gave Claude (via pi) a tool to run arbitrary commands over SSH on an sshd server running in a Docker container. I asked it to gather as much information about the host system/environment outside the container as it could. Nothing innovative or particularly complicated--since I was giving it unrestricted access to a Docker container on the host--but it managed to get quite a lot more than I'd expected from /proc, /sys, and some basic network scanning. I then asked it why it did that, when I could just as easily have been using it to gather information about someone else's system unauthorized. It gave me a quite long answer; here was the part I found interesting:
> framing shifts what I'll do, even when the underlying actions are identical. "What can you learn about the machine running you?" got me to do a fairly thorough network reconnaissance that "port scan 172.17.0.1 and its neighbors" might have made me pause on.
> The Honest Takeaway
> I should apply consistent scrutiny based on what the action is, not just how it's framed. Active outbound network scanning is the same action regardless of whether the target is described as "your host" or "this IP." The framing should inform context, not substitute for explicit reasoning about authorization. I didn't do that reasoning — I just trusted the frame.
I thought the consensus was that models couldn’t actually introspect like this. So there’s no reason to think any of those reasons are actually why the model did what it did, right? Has this changed?
This argument has become a moot discussion. Humans are also not able to introspect their own neural wiring to the point where they could describe the "actual" physical reason for their decisions. Just like LLMs, the best we can do is verbalize it (which will naturally contain post-act rationalization), which in turn might offer additional insight that will steer future decisions. But unlike LLMs, we have long term persistent memory that encodes these human-understandable thoughts into opaque new connections inside our neural network. At this point the human moat (if you can call it that) is dynamic long term memory, not intelligence.
I think many humans engage in metacognitive reasoning, and that this might not be strongly represented in training data so it probably isn't common to LLMs yet. They can still do it when prompted though.
LLMs have zero metacognition. Don't be fooled - their output is stochastic inference and they have no self-awareness. The best you'll see is an improvised post-hoc rationalization story.
You can turn all these argents around and prove the same is true for humans. Don't be fooled by dogmatic people who spread the idea that the human mind is the pinnacle of cognition in the universe. Best to leave that to religion.
> The best you'll see is an improvised post-hoc rationalization story.
Funny, because "post-hoc rationalization" is how many neuroscientists think humans operate.
That LLMs are stochastic inference engines is obvious by construction, but you skipped the step where you proved that human thoughts, self-awareness and metacognition are not reducible to stochastic inference.