If you cut out the vulnerable code from Heartbleed and just put it in front of a C programmer, they will immediately flag it. It's obvious. But it took Neel Mehta to discover it. What's difficult about finding vulnerabilities isn't properly identifying whether code is mishandling buffers or holding references after freeing something; it's spotting that in the context of a large, complex program, and working out how attacker-controlled data hits that code.

It's weird that Aisle wrote this.

> It's weird that Aisle wrote this.

No, writing an advertisement is not weird. What's weird is that it's top of HN. Or really, no, this isn't weird either if you think about it -- people lookin for a gotcha "Oh see, that new model really isn't that good/it's surely hitting a wall/plateau any day now" upvoted it.

It's not weird. Top of HN is worthless as a barometer at this point, people downvote for calling out AI slop.

Nah, Saturday post. Less news less content.

It's also that humans are very bad at repetitive detailed tasks. Sitting down with a code base and looking at each function for integer overflow comparison bugs gets boring really fast. It's a rare person who can do that for as long as it takes to find a bug that they don't already have some clues about.

It's the flaw in the "given enough eyeballs, all bugs are shallow" argument. Because eyeballs grow tired of looking at endless lines of code.

Machines on the other hand are excellent at this. They don't get bored, they just keep doing what they are told to do with no drop-off in attention or focus.

idk man, pay me enough money and I’ll look at as much code as you want looking for integer overflows

Would it be cheaper than Claude Mythos doing it? No idea. Maybe, maybe not.

But it’s weird how we’re willing to throw away money to a megacorp to do it with “automation” for potentially just as much if not more as it would cost to just have big bounty program or hiring someone for nearly the same cost and doing it “normally”.

It would really have to be substantially less cost for me to even consider doing it with a bot.

> idk man, pay me enough money and I’ll look at as much code as you want looking for integer overflows

So would I, but it doesn't negate that we, humans, are bad at this. We will get bored and our focus will begin to drift. We might not notice it, we might not want to admit it, but after a few continuous hours we will start missing things.

And there aren't enough security researchers in the world to review ALL the files from OpenBSD.

And if there were, the cost would be more like $20M than 20K.

Having all code reviewed for security, by some level of LLM, should be standard at this point.

It's weird, because when working on a big project, taking a break for a week or two, and returning to it, I will find a bug and will see hundreds of lines of code that are absolutely terrible, and I will tell myself "Tom you know better than to do this, this is a rookie mistake".

I think people forget that it's hard to be clever and tidy 100% of the time. Big programs take a lot of discipline and an understanding of the context that can be really hard to maintain. This is one of several reasons that my second draft or third draft of code is almost always considerably better than the first draft.

If it’s obvious when you look close, then automate looking close. Seems simple to write tools that spider thru a code base, finding logical groupings and feeding them into an LLM with prompts like “there is a vulnerability in this code, find it”.

The thesis is, the tooling is what matters - the tools (what they call the harness) can turn a dumb llm into a smart llm.

Hold on, I misread your comment because I'm knee-jerk about code scanners, which were the bane of my existence for a while. Reworking... and: done. The original comment was just the first graf without the LLM qualification. Sorry about that.

The general approach without LLMs doesn't work. 50 companies have built products to do exactly what you propose here; they're called static application security testing (SAST) tools, or, colloquially, code scanners. In practice, getting every "suspicious" code pattern in a repository pointed out isn't highly valuable, because every codebase is awash in them, and few of them pan out as actual vulnerabilities (because attacker-controlled data never hits them, or because the missing security constraint is enforced somewhere else in the call chain).

Could it work with LLMs? Maybe? But there's a big open question right now about whether hyperspecific prompts make agents more effective at finding vulnerabilities (by sparing context and priming with likely problems) or less effective (by introducing path dependent attractors and also eliminating the likelihood of spotting vulnerabilities not directly in the SAST pattern book).

I have long said that static checkers get ten false positives. note that size of the code is not a consideration, it doesn't matter if it the four line 'hello world' or the 10 million line monster some of us work on, it is ten max false positive.

Right, but they didn't actually test that, did they?

[dead]

It’s like not differentiating between solving and verifying.

“PKI is easy to break if someone gives us the prime factors to start with!”

The point of contention is whether Mythos is the product of its intelligence or its harness; the results like this, and other similar testimonies, call into question too-dangerous-to-release marketing, and for good reason, too. Because it is powerful marketing. Aisle merely says the intelligence is there in the small models. I say, it's already clear that competent defenders could viably mimic, or perhaps even eclipse what Mythos does, by (a) making better harness, (b) simply spending more on batch jobs, bootstrapping, cache better, etc. You may not be doing this yourself, but your probably should.

Aisle and Anthropic are literally talking about two different problem spaces.