I’m not sure how to reconcile anthropic’s update / some of the exuberant comments here with recent feedback like the following from curl maintainer Daniel Steinberg:
“I see no evidence that this setup [Mythos] finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing.”
https://daniel.haxx.se/blog/2026/05/11/mythos-finds-a-curl-v...
You’re right, it’s a valid data point. But the U.K. government report is also a data point, and the Firefox report is a data point, and they suggest that it is, indeed, significantly better than current generation models. Maybe curl is significantly better hardened than most projects?
In any event, it barely matters. As Anthropic acknowledges, next level models are comings, theirs is only one of them. Current generation models are already good at things like tracing data flow through complex systems and there’s no reason to think that capability has topped out. So within a year it seems very likely we’ll have more than one commercially available model able to find vulnerabilities cheaply.
On the other hand, it seems that they’ve made much less progress on getting it to design solutions to these issues.
> Maybe curl is significantly better hardened than most projects?
Meanwhile from [1]:
"Not even half-way through this #curl release cycle we are already at 11 confirmed vulnerabilities - and there are three left in the queue to assess and new reports keep arriving at a pace of more than one/day."
"The simple reason is: the (AI powered) tools are this good now. And people use these tools against curl source code.They find lots of new problems no one detected before. And none of these new ones used Mythos. Focusing on Mythos is a distraction - there are plenty of good models, and people who can figure out how to get those models and tools to find things."
Yeah, it looks like there are at least 11 security bugs missed by Mythos.
[1] https://www.linkedin.com/feed/update/urn:li:activity:7463481...
I’m trying to reconcile this with TFA. Because the article says that the majority of vulns found by Mythos are being reported by independent researchers after validation. They never said those reports inform that mythos was involved - and I suspect they don’t. So did any of these 11 CVEs come from that channel?
I don't think anyone has claimed that Mythos finds all vulns in all projects. But it's very good if Mozilla's blog posts are anything to go by.
Based on the article here, and Firefox's mythos article, they had found bugs with Opus 4.6 as well but mythos is finding more that it missed.
That would align with the curl feedback you linked, they aren't using mythos but are finding bugs with other models. Presumably the expectation would be that with mythos they'd find more that were missed by other models already used.
> Based on the article here, and Firefox's mythos article, they had found bugs with Opus 4.6 as well but mythos is finding more that it missed.
It's not quite apples-to-apples. It was Opus on Firefox 148, Mythos on 150. A better test of Mythos vs Opus would have been to apply Mythos to Firefox 148. Or also re-apply Opus to Firefox 150.
Do we know all the Opus+Firefox 148 bugs are fixed in Firefox 150? Do we know the number of new bugs introduced per Firefox release?
> Do we know all the Opus+Firefox 148 bugs are fixed in Firefox 150? Do we know the number of new bugs introduced per Firefox release?
That may be parsable from their bug tracker, though I don't know of all bugs raised by mythos are public.
I'd be particularly interested in how many of the bugs found existed in 148. Assuming most or all of them weren't newly created bugs added in 149 or 150, the comparison should still hold even though Opus and Mythos looked at different releases.
The same UK security research body ran the same CTF against GPT5.5. GPT5.5 got the same result as Mythos.
Anthropic promised us that Mythos was such an existential threat that it would compromise "every OS and browser on devices across the planet". They've held conferences and meetings with banks and govts across the world, shouting how critical this issue is.
GPT5.5 has been out for a month. Every device on earth has not been breached yet. It's very fair to criticize Anthropic's maximalist posturing when it's becoming exceedingly clear their models are fairly behind OpenAI's in capability.
In my opinion, the original commenter's statement stands, and the UK govt data point only helps support that due to the equal result between Mythos and GPT.
I'd advise reading into the specifics of what happened with Firefox; the TL;DR is a reduced safety version of its code was scanned by Opus 4.6 (yes Opus) and found a multitude of bugs and 4 high severity vulns that did not escape sandbox. The Mythos system card test describes running Mythos against the same issues Opus found to see if it could reliably replicate and chain together an attack.
I think for every point, we need to know how many tokens and cost were burned to achieve a desired outcome. And how buggy each software was to start.
I think people sometimes misunderstand Daniel's point here, though it's clearer when taken in context of the rest of his article. The tools in general are getting a lot better at finding security bugs, it was unclear to Daniel based on his usage whether Mythos in particular is a huge step, but the Mythos generation of LLMs definitely are. Note though that Daniel was using Mythos somewhat indirectly. One thing I've taken away from the whole Mythos debate is that a) I suspect that Anthropic's GPU crunch meant that they felt they had to ration Mythos access anyway, so the calculus of whether they would release it generally was probably influenced by that, and b) finding bugs with Mythos or a similar model is still expensive -- a $20K or $100K Mythos run on Curl might have shown the same level of issues as other projects like Firefox, but Daniel didn't get that kind of access.
He posted a general update today on LinkedIn which I think gives the wider context:
https://www.linkedin.com/feed/update/urn:li:activity:7463481...
> Not even half-way through this hashtag#curl release cycle we are already at 11 confirmed vulnerabilities - and there are three left in the queue to assess and new reports keep arriving at a pace of more than one/day.
> 11 CVEs announced in a single release is our record from 2016 after the first-ever security audit (by Cure 53).
> This is the most intense period in hashtag#curl that I can remember ever been through.
Curl has more eyes on it, and has had more tools thrown at it, and is better tested (and developed?) than 99% of software, it's very much not the norm. I wouldn't be surprised if that has something to do with it, if there is any kind of bias there (not sure if there is, it's also possible he's just right).
Different people can have different experiences without contradiction. Maybe the curl source code was pretty clean to begin with?
imo curl is quite well maintained. there are a lot of sloppy projects out there and tools like this shows whos been swimming with their pants down. not saying any project with vulnerabilities are sloppy but when costs of finding bugs and vulnerabilities decrease significantly, they will get exposed with enough time and tokens ($)
Fortunately, this is just a press release for their new product 'Claude Security'. Just contact sales to find out more https://claude.com/product/claude-security
Daniel has been posting for months (years?) about how much scrutiny he gets from security researchers and various automated tools. I wouldn't expect curl to be the average case for mythos.
It is the opposite. Security people focus on curl, sudo because they are code bases that contained a lot of features and unused code from the 1990s.
They don't focus on projects where they find nothing. They certainly don't advertise when they find nothing.
Getting a lot of scrutiny is not the recommendation that it appears to be. What is the new standard? Projects that never have bugs are deemed to be suspect because they "have not been scrutinized" (they have, but null results never go public)?
So Mythos only finding one issue after other tools have found 300 this year is embarrassing. Mythos was supposed to be better and novel.
It is definitely not the case that curl has been or is now a marquee vulnerability research target. It's a CLI HTTP fetcher. It's the same with sudo. It's a big deal if a sudo vulnerability gets found, because it's an extremely load-bearing piece of software, but sudo is itself not a prime target, because it doesn't do much.
There is no claim that it is a "vulnerability research target". It is a bug finding magnet, and bugs can be found by anything from gcc warnings to AI tools.
No, it didn't attract a bluepill exploit research.
The fact that 300 bugs found in a year is not a recommendation as the pro-AI mafia suddenly claims ("because it has been analyzed!") still stands. Maybe the AI-mafia should sell "analyzed by Mythos" labels to impress people who don't write public software or find bugs for that matter.
What’s a “bluepill exploit”?
[flagged]
You are linking to a Wikipedia page in which I am literally cited (I presented a hypervisor malware detection scheme at the Black Hat conference where Joanna Rutkowska presented this; it was a whole thing). I'm telling you that the term makes no sense in this thread. I think you meant to use a different term.
[flagged]
Stop abusing the system with new accounts. You're not cool like that.
What's with the nonstop new accounts...?
[flagged]
Did you... create a new account just to be able to respond to Thomas?
Btw, he's a security researcher. You should be more respectful.
I don't care if they're respectful, but they should try to be less confusing. "Blue Pill" isn't a kind of exploit. I assumed they meant "blue hat".
[flagged]
What am I?
Curl, according to the authors own admission, is the most heavily tested and fuzzed open source library out there. So I think for him it's a different situation
Again, the XBOW article is pretty relevant: https://xbow.com/blog/mythos-offensive-security-xbow-evaluat...
It's a weird accident of fate that curl has somehow become the reference target for LLM bugfinding. Curl is not an especially interesting project. What seems to have happened is that Stenberg made waves (legitimately) complaining about LLM slop submissions, then more waves when LLM bug reports got good, and so now everyone seems to think a good measure of a vuln researcher is how many curl findings they generate. No. Curl is a straightforward CLI HTTP client.
The Linux kernel is the right reference target, if you need one.
Or SSH, OpenSSL, Envoy, Nginx, etc. Curl has a real footprint, but it isn't just out there passively attackable. Linux Kernel is right as a default.
OpenSSH is a legitimately high bar, one of the hardest targets in all memory-unsafe software.
Curl is a high bar for a different reason (the same one as sudo): it doesn't do enough to be all that interesting. Stenberg is having trouble keeping up with all the inbounds, but look at the 2026 CVEs: they all seem kind of boring? Exploit developers aren't hunting for "wrong reuse of HTTP Negotiate connection". Like, yes, these are legitimate bugs, important that they get fixed, but none of them are prizes.
By rights, OpenSSH should be a smoking crater. It's not, I believe because of sheer engineering excellence.
If I said what I think, dang would tell me to read the site's guidelines.
He already scanned the codebase with Codex Security and a whole bunch of other AI tools, and fixed 200-300 bugs and CVEs. On top of that Mythos found 1 more bug and 1 more CVE is already impressive.
I believe that the real difference is the token burning to analyze entire code bases.
What I think based on the various things I've read is that Mythos is a standard advance in raw capability that was heavily trained on the process of being a security researcher. If you already had the skills to find and exploit bugs then Mythos is not a game changer, if you're an ordinary programmer it is a game changer because it's been so well tuned to wear the security researcher hat you don't have to give it much feedback at all.
I'll say it. From the language of his post it doesn't seem like he was using Mythos with the correct harness / the way you're supposed to. A friend lent (?) it to him.
Yes, moving the goalposts, holding it wrong, yes that's what I believe
> I’m not sure how to reconcile anthropic’s update ...
Why not? TFA says 23 000 findings "of all severities" and then, in the end, only 88 security advisories published.
What we'd really need is how many security advisories not related to Mythos findings have been published in the same time. If it's, say, 500 security advisories (just making a number up), wouldn't Anthropic's update in TFA and Daniel Steinberg's comments reconcile?
Like, yup, we've got a new tool to find exploits. It's a tool. It's new. We already had tools. Let's make the software world a bit more secure.
Now if you tell me that 100 security advisories have been published in that timespan and that 88 were due to Anthropic's Mythos: now I'd have to say that it's hard to reconcile Daniel Steinberg's position with TFA.