Maybe they did use small models but you couldn't make the front page of HN with something like this until Anthropic made a big fuss out of it. Or perhaps it is just a question of compute. Not everyone has 20k$ or the GPU arsenal to task models to find vulnerabilities which may/may not be correct?

Unless Anthropic makes it known exactly what model + harness/scaffolding + prompt + other engineering they did, these comparisons are pointless. Given the AI labs' general rate of doomsday predictions, who really knows?

papers are always coming out saying smaller models can do these amazing and terrifying things if you give them highly constrained problems and tailored instructions to bias them toward a known solution. most of these don't make the front page because people are rightfully unimpressed