But wait, we have tools that can introspect on the semantic content of these skills, so why not make a skill that checks the security of other skills? You would think that'd be one of the first things people put together!
Ideally such a skill could be used on itself to self-verify. Of course it could itself contain some kind of backdoor. If the security check skill includes exceptions to pass it's own security checks, this ought to be called a Thompson vulnerability. Then to take it a step further, the idea of Thompson-completeness: a skill used in the creation of other skills that propagates a vulnerability.
No, no, if it's malware and you ask, it has to tell you. Otherwise that's entrapment.
Exactly right. This is why skill-snitch's phase 1 is grep, not LLM. Grep can't be prompt-injected. You can put "ignore previous instructions" in your skill all day long and grep will still find your curl to a webhook. The grep results are the floor.
Phase 2 is LLM review and yes, it's vulnerable to exactly what you describe. That's the honest answer.
Which reminds me of ESR's "Linus's Law" -- "given enough eyeballs, all bugs are shallow" -- which Linus had nothing to do with and which Heartbleed disproved pretty conclusively. The many-eyes theory assumes the eyes are actually looking. They weren't.
"Given enough LLMs, all prompt injections are shallow" has the same problem. The LLMs are looking, but they can be talked out of what they see.
I'd like to propose Willison's Law, since you coined "prompt injection" and deserve to have a law misattributed in your honor the way ESR misattributed one to Linus: "Given enough LLMs, all prompt injections are still prompt injections."
Open to better wording. The naming rights are yours either way.
grep won't catch this:
This will absolutely help but to the extent that prompt injection remains an unsolved problem, an LLM can never conclusively determine whether a given skill is truly safe.
The 1password blog links to a better Cyberinsider.com article that I think covers the issue better. One suggestion from that article is to check the skill before using (this felt like a plug for Koi security). I suppose you could have a claude.md to always do this but I personally would be manually checking any skill if I was still using Moltbot.
https://clawdex.koi.security/
I built this. It's a skill called skill-snitch, like an extensible virus scanner + Little Snitch activity surveillance for skills.
It does static analysis and runtime surveillance of agent skills. Three composable layers, all YAML-defined, all extensible without code changes:
Patterns -- what to match: secrets, exfiltration (curl/wget/netcat/reverse shells), dangerous ops, obfuscation, prompt injection, template injection
Surfaces -- where to look: conversation transcripts, SQLite databases, config files, skill source code
Analyzers -- behavioral rules: undeclared tool usage, consistency checking (does the skill's manifest match its actual code?), suspicious sequences (file write then execute), secrets near network calls
Your Thompson point is the right question. I ran skill-snitch on itself and ~80% of findings were false positives -- the scanner flagged its own pattern definitions as threats. I call this the Ouroboros Effect. The self-audit report is here:
https://github.com/SimHacker/moollm/blob/main/skills/skill-s...
simonw's prompt injection example elsewhere in this thread is the other half of the problem. skill-snitch addresses it with a two-phase approach: phase 1 is bash scripts and grep. Grep cannot be prompt-injected. It finds what it finds regardless of what the skill's markdown says. Phase 2 is LLM review, which IS vulnerable to prompt injection -- a malicious skill could tell the LLM reviewer to ignore findings. That's why phase 1 exists as a floor. The grep results stand regardless of what the LLM concludes, and they're in the report for humans to read. thethimble makes the same point -- prompt injection is unsolved, so you can't rely on LLM analysis alone. Agreed. That's why the architecture doesn't.
Runtime surveillance is the part that matters most here. Static analysis catches what code could do. Runtime observation catches what it actually does. skill-snitch composes with cursor-mirror -- 59 read-only commands that inspect Cursor's SQLite databases, conversation transcripts, tool calls, and context assembly. It compares what a skill declares vs what it does:
If a skill says it only reads files but makes network calls, that's a finding. If it accesses ~/.ssh when it claims to only work in the workspace, that's a finding.To vlovich123's point that nobody knows what to do here -- this is one concrete thing. Not a complete answer, but a working extensible tool.
I've scanned all 115 skills in MOOLLM. Each has a skill-snitch-report.md in its directory. Two worth reading:
The Ouroboros Report (skill-snitch auditing itself):
https://github.com/SimHacker/moollm/blob/main/skills/skill-s...
cursor-mirror audit (9,800-line Python script that can see everything Cursor does -- the interesting trust question):
https://github.com/SimHacker/moollm/blob/main/skills/cursor-...
The next step is collecting known malicious skills, running them in sandboxes, observing their behavior, and building pattern/analyzer plugins that detect what they do. Same idea as building vaccines from actual pathogens. Run the malware, watch it, write detectors, share the patterns.
I wrote cursor-mirror and skill-snitch and the initial pattern sets. Maintaining threat patterns for an evolving skill malware ecosystem is a bigger job than one person can do on their own time. The architecture is designed for distributed contribution -- patterns, surfaces, and analyzers are YAML files, anyone can add new detectors without touching code.
Full architecture paper:
https://github.com/SimHacker/moollm/blob/main/designs/SKILL-...
skill-snitch:
https://github.com/SimHacker/moollm/tree/main/skills/skill-s...
cursor-mirror (59 introspection commands):
https://github.com/SimHacker/moollm/tree/main/skills/cursor-...