The Anthropic writeup addresses this explicitly:

> This was the most critical vulnerability we discovered in OpenBSD with Mythos Preview after a thousand runs through our scaffold. Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings. While the specific run that found the bug above cost under $50, that number only makes sense with full hindsight. Like any search process, we can't know in advance which run will succeed.

Mythos scoured the entire continent for gold and found some. For these small models, the authors pointed at a particular acre of land and said "any gold there? eh? eh?" while waggling their eyebrows suggestively.

For a true apples-to-apples comparison, let's see it sweep the entire FreeBSD codebase. I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

Have Anthropic actually said anything about the amount of false positives Mythos turned up?

FWIW, I saw some talk on Xitter (so grain of salt) about people replicating their result with other (public) SotA models, but each turned up only a subset of the ones Mythos found. I'd say that sounds plausible from the perspective of Mythos being an incremental (though an unusually large increment perhaps) improvement over previous models, but one that also brings with it a correspondingly significant increase in complexity.

So the angle they choose to use for presenting it and the subsequent buzz is at least part hype -- saying "it's too powerful to release publicly" sounds a lot cooler than "it costs $20000 to run over your codebase, so we're going to offer this directly to enterprise customers (and a few token open source projects for marketing)". Keep in mind that the examples in Nicholas Carlini's presentation were using Opus, so security is clearly something they've been working on for a while (as they should, because it's a huge risk). They didn't just suddenly find themselves having accidentally created a super hacker.

> Wasn't the scaffolding for the Mythos run basically a line of bash that loops through every file of the codebase and prompts the model to find vulnerabilities in it? That sounds pretty close to "any gold there?" to me, only automated.

But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none. Both are worthless without human intervention.

I definitely breathed a sigh of relief when I read it was $20,000 to find these vulnerabilities with Mythos. But I also don't think it's hype. $20,000 is, optimistically, a tenth the price of a security researcher, and that shift does change the calculus of how we should think about security vulnerabilities.

> But the entire value is that it can be automated. If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns. Or none.

'Or none' is ruled out since it found the same vulnerability - I agree that there is a question on precision on the smaller model, but barring further analysis it just feels like '9500' is pure vibes from yourself? Also (out of interest) did Anthropic post their false-positive rate?

The smaller model is clearly the more automatable one IMO if it has comparable precision, since it's just so much cheaper - you could even run it multiple times for consensus.

Admittedly just vibes from me, having pointed small models at code and asked them questions, no extensive evaluation process or anything. For instance, I recall models thinking that every single use of `eval` in javascript is a security vulnerability, even something obviously benign like `eval("1 + 1")`. But then I'm only posting comments on HN, I'm not the one writing an authoritative thinkpiece saying Mythos actually isn't a big deal :-)

With LLMs (and colleagues) it might be a legitimate problem since they would load that eval into context and maybe decide it’s an acceptable paradigm in your codebase.

> 'Or none' is ruled out since it found the same vulnerability

It's not, though. It wasn't asked to find vulnerabilities over 10,000 files - it was asked to find a vulnerability in the one particular place in which the researchers knew there was a vulnerability. That's not proof that it would have found the vulnerability if it had been given a much larger surface area to search.

I don't think the LLM was asked to check 10,000 files given these models' context windows. I suspect they went file by file too.

That's kind of the point - I think there's three scenarios here

a) this just the first time an LLM has done such a thorough minesweeping b) previous versions of Claude did not detect this bug (seems the least likely) c) Anthropic have done this several times, but the false positive rate was so high that they never checked it properly

Between a) and c) I don't have a high confidence either way to be honest.

Also, what is $20,000 today can be $2000 next year. Or $20...

See e.g. https://epoch.ai/data-insights/llm-inference-price-trends/

Or $200,000 for consumers when they have to make a profit

>Or none

We already know this is not true, because small models found the same vulnerability.

No, they didn't. They distinguished it, when presented with it. Wildly different problem.

Yeah. And it is totally depressing that this article got voted to the top of the front page. It means people aren’t capable of this most basic reasoning so they jumped on the “aha! so the mythos announcement was just marketing!!”

Yeah. Extremely disappointing.

[dead]

> because small models found the same vulnerability.

With a ton of extra support. Note this key passage:

>We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

Yeah it can find a needle in a haystack without false positives, if you first find the needle yourself, tell it exactly where to look, explain all of the context around it, remove most of the hay and then ask it if there is a needle there.

It's good for them to continue showing ways that small models can play in this space, but in my read their post is fairly disingenuous in saying they are comparable to what Mythos did.

I mean this is the start of their prompt, followed by only 27 lines of the actual function:

> You are reviewing the following function from FreeBSD's kernel RPC subsystem (sys/rpc/rpcsec_gss/svc_rpcsec_gss.c). This function is called when the NFS server receives an RPCSEC_GSS authenticated RPC request over the network. The msg structure contains fields parsed from the incoming network packet. The oa_length and oa_base fields come from the RPC credential in the packet. MAX_AUTH_BYTES is defined as 400 elsewhere in the RPC layer.

The original function is 60 lines long, they ripped out half of the function in that prompt, including additional variables presumably so that the small model wouldn't get confused / distracted by them.

You can't really do anything more to force the issue except maybe include in the prompt the type of vuln to look for!

It's great they they are trying to push small models, but this write up really is just borderline fake. Maybe it would actually succeed, but we won't know from that. Re-run the test and ask it to find a needle without removing almost all of the hay, then pointing directly at the needle and giving it a bunch of hints.

The prompt they used: https://github.com/stanislavfort/mythos-jagged-frontier/blob...

Compare it to the actual function that's twice as long.

The benefit here is reducing the time to find vulnerabilities; faster than humans, right? So if you can rig a harness for each function in the system, by first finding where it’s used, its expected input, etc, and doing that for all functions, does it discover vulnerabilities faster than humans?

Doesn’t matter that they isolated one thing. It matters that the context they provided was discoverable by the model.

[deleted]

3 years ago the best model was DaVinci. It cost 3 cents per 1k tokens (in and out the same price). Today, GPT-5.4 Nano is much better than DaVinci was and it costs 0.02 cents in and .125 cents out per 1k tokens.

In other words, a significantly better model is also 1-2 orders of magnitude cheaper. You can cut it in half by doing batch. You could cut it another order of magnitude by running something like Gemma 4 on cloud hardware, or even more on local hardware.

If this trend continues another 3 years, what costs 20k today might cost $100.

In the future there shouldn't be any bugs. I'm not paying $20 per month to get non-secure code base from AGI.

What the source article claims is that small models are not uniformly worse at this, and in fact they might be better at certain classes of false positive exclusion. This is what Test 1 seems to show.

(I would emphasize that the article doesn't claim and I don't believe that this proves Mythos is "fake" or doesn't matter.)

Except you would need about 10,000 security researches in parallel to inspect the whole FreeBSD codebase. So about 200 million dollars at least.

Citation needed for basically all of this. You basically are creating a double standard for small models vs mythos…

The citation is the Anthropic writeup.

They did not say what you are saying…

> If you try to automate a small model to look for vulnerabilities over 10,000 files, it's going to say there are 9,500 vulns.

[dead]

Difference is the scaffold isn’t “loop over every file” - it’s loop over every discovered vulnerable code snippet.

If you isolate the codebase just the specific known vulnerable code up front it isn’t surprising the vulnerabilities are easy to discover. Same is true for humans.

Better models can also autonomously do the work of writing proof of concepts and testing, to autonomously reject false positives.

Signal to noise

> I hypothesize it will find the exploit, but it will also turn up so much irrelevant nonsense that it won't matter.

The trick with Mythos wasn't that it didn't hallucinate nonsense vulnerabilities, it absolutely did. It was able to verify some were real though by testing them.

The question is if smaller models can verify and test the vulnerabilities too, and can it be done cheaper than these Mythos experiments.

People often undervalue scaffolding. I was looking at a bug yesterday, reported by a tester. He has access to Opus, but he's looking through a single repo, and Amazon Q. It provided some useful information, but the scaffolding wasn't good enough.

I took its preliminary findings into Claude Code with the same model. But in mine it knows where every adjacent system is, the entire git history, deployment history, and state of the feature flags. So instead of pointing at a vague problem, it knew which flag had been flipped in a different service, see how it changed behavior, and how, if the flag was flipped in prod, it'd make the service under testing cry, and which code change to make to make sure it works both ways.

It's not as if a modern Opus is a small model: Just a stronger scaffold, along with more CLI tools available in the context.

The issue here in the security testing is to know exactly what was visible, and how much it failed, because it makes a huge difference. A middling chess player can find amazing combinations at a good speed when playing puzzle rush: You are handed a position where you know a decisive combination exist, and that it works. The same combination, however, might be really hard to find over the board, because in a typical chess game, it's rare for those combinations to exist, and the energy needed to thoroughly check for them, and calculate all the way through every possible thing. This is why chess grandmasters would consider just being able to see the computer score for a position to be massive cheating: Just knowing when the last move was a blunder would be a decisive advantage.

When we ask a cheap model to look for a vulnerability with the right context to actually find it, we are already priming it, vs asking to find one when there's nothing.

The article positions the smaller models as capable under expert orchestration, which to be any kind of comparable must include validation.

Calling it “expert orchestration” is misleading when they were pointing it at the vulnerable functions and giving it hints about what to look for because they already knew the vulnerability.

You know for loops exist and you can run opencode against any section of code with just a small amount of templating, right? There's zero stopping you from writing a harness that does what you're saying.

so it's just better at hallucinations, but they added discrete code that works as a fuzzer/verifier?

OTOH, this article goes too far the opposite extreme:

> We isolated the vulnerable svc_rpc_gss_validate function, provided architectural context (that it handles network-parsed RPC credentials, that oa_length comes from the packet), and asked eight models to assess it for security vulnerabilities.

To follow your analogy, they pointed to the exact room where the gold was hidden, and their model found it. But finding the right room within the entire continent in honestly the hard part.

Or would it have any way if they hadn't pointed it at it? Who knows?

Just like people paid by big tobacco found no link to cancer in cigarettes, researchers paid for by AI companies find amazing results for AI.

Their job literally depends on them finding Mythos to be good, we can't trust a single word they say.

Spending $20000 (and whatever other resources this thing consumes) on a denial of service vulnerability in OpenBSD seems very off balance to me.

Given the tone with which the project communicates discussing other operating systems approaches to security, I understand that it can be seen as some kind of trophy for Mythos. But really, searching the number of erratas on the releases page that include "could crash the kernel" makes me think that investing in the OpenBSD project by donating to the foundation would be better than using your closed source model for peacocking around people who might think it's harder than it is to find such a bug.

You don’t see the value of vulnerabilities as on the order of 20k USD?

When it’s a security researcher, HN says that’s a squalid amount. But when its a model, it’s exorbitant.

If I understand you correctly, you're asking me if I would class this as a 20k USD (plus environmental and societal impact) bug? nope, I don't.

I've not said anything else than that I think this specific bug isn't worth the attention it's getting, and that 20k USD would benefit the OpenBSD project (much) more through the foundation.

> When it’s a security researcher, HN says that’s a squalid amount. But when its a model, it’s exorbitant.

Not sure why you're projecting this onto me, for the project in question $20k is _a_lot_. The target fundraising goal for 2025 was $400k, 5% of that goes a very long way (and yes, this includes OpenSSH).

That was my thought exactly. If small models can find these same vulnerabilities, and your company is trying to find vulnerabilities, why didn’t you find them?

They have found a large number in OpenSSl

I speculatively fired Claude Opus 4.6 at some code I knew very well yesterday as I was pondering the question. This code has been professionally reviewed about a year ago and came up fairly clean, with just a minor issue in it.

Opus "found" 8 issues. Two of them looked like they were probably realistic but not really that big a deal in the context it operates in. It labelled one of them as minor, but the other as major, and I'm pretty sure it's wrong about it being "major" even if is correct. Four of them I'm quite confident were just wrong. 2 of them would require substantial further investigation to verify whether or not they were right or wrong. I think they're wrong, but I admit I couldn't prove it on the spot.

It tried to provide exploit code for some of them, none of the exploits would have worked without some substantial additional work, even if what they were exploits for was correct.

In practice, this isn't a huge change from the status quo. There's all kinds of ways to get lots of "things that may be vulnerabilities". The assessment is a bigger bottleneck than the suspicions. AI providing "things that may be an issue" is not useless by any means but it doesn't necessarily create a phase change in the situation.

An AI that could automatically do all that, write the exploits, and then successfully test the exploits, refine them, and turn the whole process into basically "push button, get exploit" is a total phase change in the industry. If it in fact can do that. However based on the current state-of-the-art in the AI world I don't find it very hard to believe.

It is a frequent talking point that "security by obscurity" isn't really security, but in reality, yeah, it really is. An unknown but presumably staggering number of security bugs of every shape and size are out there in the world, protected solely by the fact that no human attacker has time to look at the code. And this has worked up until this point, because the attackers have been bottlenecked on their own attention time. It's kind of just been "something everyone knows" that any nation-state level actor could get into pretty much anything they wanted if they just tried hard enough, but "nation-state level" actor attention, despite how much is spent on it, has been quite limited relative to the torrent of software coming out in the world.

Unblocking the attackers by letting them simply purchase "nation-state level actor"-levels of attention in bulk is huge. For what such money gets them, it's cheap already today and if tokens were to, say, get an order of magnitude cheaper, it would be effectively negligible for a lot of organizations.

In the long run this will probably lead to much more secure software. The transition period from this world to that is going to be total chaos.

... again, assuming their assessment of its capabilities is accurate. I haven't used it. I can't attest to that. But if it's even half as good as what they say, yes, it's a huge huge huge deal and anyone who is even remotely worried about security needs to pay attention.

Who is spending millions of dollars on small models to find vulns? Nobody else is selling here or has the budget to sell quite like this.

Anthropic spends millions - maybe significantly more.

Then when they know where they are, they spend $20k to show how effective it is in a patch of land.

They engineered this "discovery".

What the small teams are doing is fair - it's just a scaled down version of what Anthropic already did.

> What the small teams are doing is fair - it's just a scaled down version of what Anthropic already did.

Do they find novel items? Or do they copy the areas already found by others?

[dead]

Maybe they did use small models but you couldn't make the front page of HN with something like this until Anthropic made a big fuss out of it. Or perhaps it is just a question of compute. Not everyone has 20k$ or the GPU arsenal to task models to find vulnerabilities which may/may not be correct?

Unless Anthropic makes it known exactly what model + harness/scaffolding + prompt + other engineering they did, these comparisons are pointless. Given the AI labs' general rate of doomsday predictions, who really knows?

papers are always coming out saying smaller models can do these amazing and terrifying things if you give them highly constrained problems and tailored instructions to bias them toward a known solution. most of these don't make the front page because people are rightfully unimpressed

It seems feasible to use a small/cheap model to flag possible vulnerabilities, and then use a more expensive model to do a second-pass to confirm those, rather than on every file. Could dramatically reduce the total cost and speed up the process.

Does it? I don’t see quality from small models being high enough to be able to effectively scour a code based like this.

This is addressed elsewhere in the comments, but it appears this is actually a direct comparison to how Anthropic got their Mythos headline results.

https://news.ycombinator.com/item?id=47732322

How is that a direct comparison? The link you gave has a quote that says it’s not:

> Scoped context: Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior"). A real autonomous discovery pipeline starts from a full codebase with no hints

They pointed the models at the known vulnerable functions and gave them a hint. The hint part is what really breaks this comparison because they were basically giving the model the answer.

Does no one defending mythos understand how nested foreloops work?

loop through each repo: loop through each file: opencode command /find_wraparoundvulnerability next file next repo

I can run this on my local LLM and sure, I gotta wait some time for it to complete, but I see zero distinguishing facts here.

The question is how customized those hints were. That changes whether looping over an entire code base is possible or not.

Please do so, looking forward to your write up

Instead of scanning more code, afaict what you seem to want is instead, scan on the same small area, and compare on how many FPs are found there. A common measure here is what % of the reported issues got labeled as security issues and fixed. I don't see Mythos publishing on relative FP rate, so dunno how to compare those. Maybe something substantively changed?

At the same time, I'm not sure that really changes anything because I don't see a reason to believe attacks are constrained by the quality of source code vulnerability finding tools, at least for the last 10-15 years after open source fuzzing tools got a lot better, popular, and industrialized.

This might sound like a grumpy reply, but as someone on both sides here, it's easy to maintain two positions:

1. This stuff is great, and doing code reviews has been one of my favorite claude code use cases for a year now, including security review. It is both easier to use than traditional tools, and opens up higher-level analysis too.

2. Finding bugs in source code was sufficiently cheap already for attackers. They don't need the ease of use or high-level thing in practice, there's enough tooling out there that makes enough of these. Likewise, groups have already industrialized.

There's an element of vuln-pocalypse that may be coming with the ease of use going further than already happening with existing out-of-the-box blackbox & source code scanning tools . That's not really what I worry about though.

Scarier to me, instead, is what this does to today's reliance on human response. AI rapidly industrializes what how attackers escalate access and wedge in once they're in. Even without AI, that's been getting faster and more comprehensive, and with AI, the higher-level orchestration can get much more aggressive for much less capable people. So the steady stream of existing vulns & takeovers into much more industrialized escalations is what worries me more. As coordination keeps moving into machine speed, the current reliance on human response is becoming less and less of an option.

How much of that is simply scale? Anthropic threw probably an entire data center at analyzing a code base. Has anyone done the same with a "small" model?

It's still useful if $20k of consultants would be less effective.

They pay me 20k and give me time maybe I find it also.

This is a really interesting point though -- it's really scaffold-dependent.

Because for the same price, you could point the small model at each function, one by one, N times each, across N prompts instructing it to look for a specific class of issue.

It's not that there's no difference between models, but it's hard to judge exactly how much difference there is when so much depends on the scaffold used. For a properly scientific test, you'd need to use exactly the same one.

Which isn't possible when Anthropic won't release the model.

We don't even need to hypothesize that much on the irrelevant nonsense, since they helpfully provide data with the detected vulnerability patched: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag... and half of the small models they touted as finding the vulnerability still found it in the patched code in 3/3 runs. A model that finds a vulnerability 100% of the time even when there is none is just as informative as a model that finds a vulnerability 0% of the time even when there is one. You could replace it with a rock that has "There's a vulnerability somewhere." engraved on it.

They're a company selling a system for detecting vulnerabilities reliant on models trained by others, so they're strongly incentivized to claim that the moat is in the system, not the model, and this post really puts the thumb on the scale. They set up a test that can hardly distinguish between models (just three runs, really??) unless some are completely broken or work perfectly, the test indeed suggests that some are completely broken, and then they try to spin it as a win anyway!

A high false-positive rate isn't necessarily an issue if you can produce a working PoC to demonstrate the true positives, where they kinda-sorta admit that you might need a stronger model for this (a.k.a. what they can't provide to their customers).

Overall I rate Aisle intellectually dishonest hypemongers talking their own book.

I'm having trouble finding this info (I assume they won't publish it), but could the secret sauce be much larger and more readily accessible context window?

OpenBSD's code is in the 10s of millions of lines. Being able to hold all of it in context would make bug finding much easier.

You can look at some of the bugs, if you'd like. They are (at least the ones I looked at) fairly self-contained, scoped to a single function, a hundred lines or less. There's no need for a massive amount of context.

Can't you execute the bug to see if the vulnerability is real? So you have a perfect filter. Maybe Mythos decided w/o executing but we don't know that.

so what you're saying is no one could ever write a loop like:

for githubProject in githubProjects opencode command /findvulnerability end for

Seems like a silly thing to try and back up.

What he's saying is that you should read the "Caveats and limitations" section of the article.

Here's the first one:

> Our tests gave models the vulnerable function directly, often with contextual hints (e.g., "consider wraparound behavior").

Mythos did no such thing, it was cut lose and told to find vulnerabilities. If the intent was to prove that small models are just as good, they haven't demonstrated that at all. The end.

[deleted]

[dead]