I meant, if the claim here is that small models can accomplish the same things with good scaffolding, why didn’t they demonstrate finding those problem with good scaffolding rather than directly pointing them at the problem?
I meant, if the claim here is that small models can accomplish the same things with good scaffolding, why didn’t they demonstrate finding those problem with good scaffolding rather than directly pointing them at the problem?
They don't have to.
Lot of people in this thread don't seem to be getting that.
If another model can find the vulnerability if you point it at the right place, it would also find the vulnerability if you scanned each place individually.
People are talking about false positives, but that also doesn't matter. Again, they're not thinking it through.
False positives don't matter, as you can just automatically try and exploit the "exploit" and if it doesn't work, it's a false positive.
Worse, we have no idea how Mythos actually worked, it could have done the process I've outlined above, "found" 1,000s of false positives and just got rid of them by checking them.
The fundamental point is it doesn't matter how the cheap models identified the exploit, it's that they can identify the exploit.
When it turns out the harness is just acting as a glorified for-each brute force, it's not the model being intelligent, it's simply the harness covering more ground. It's millions of monkeys bashing type-writers, not Shakespeare at one.