GLM 5.2 and DeepSeek v4 Pro seem to approach security research differently. This benchmark was with GLM 5.1, but the patterns are similar: https://dualuse.dev/posts/deepseek-v4-thinks-different
Overall, I still think GLM 5.2 is the much stronger performer. It's hard to tell the difference between GLM 5.2 and Opus at <120k tokens.
I have found that some models consistently find or miss specific bugs, and which bugs are hard don't completely line up across all models, so I believe that. I just refactored the security bug-finding harness I've been working on completely (not checked in yet, testing it currently) to strongly encourage "multi-model, multi-pass" scans and make them easy to orchestrate with de-dupe and weeding false positives with a strong model, rather than one model or doing just one pass over each file. Giving a model a second attempt increases their findings by 20-30%, and giving them a third, adds another 10-15%.
I'm inclined to use DeepSeek V4 Pro the most, because it is consistently extremely strong, it's very fast, it's very cheap and has excellent caching and cheap-as-free cached input tokens (something like 80% of token usage is cached when I'm using it for security scanning). So, my probably "pair" of frontline security researchers will probably be DeepSeek V4 Pro and Gemma 4 31B self-hosted (another shockingly strong contender, competitive with the best models once you let it loop on the same file a couple/few times). But, I won't be surprised if GLM 5.2 turns out better than DeepSeek V4 Pro...it costs quite a bit more.
So its like run 3 loops of “here project, find bugs” with all good models, then dedupe and priorize with a sota?
The loop is "look at this file in this repo, find bugs" iterated over every file in a project, with the ability to look at the rest of the repo for cross-file bugs related to the file they're instructed to look specifically at, but yes. The Anthropic folks have basically said that's how they're doing security audits (Nicholas Carlini is an Anthropic employee and he's done talks about it), so I assume that's how Mythos found its bugs.
I've benchmarked it, and the "here's a repo, find bugs" approach finds far fewer bugs. Like, dramatically fewer. Models are good and contexts have expanded, but focus still wins with hard problems. You could probably tell the good models to make a plan to audit the repo, and it would end up making its own "loop" in the form of a checklist of files to look at over several sessions or via subagents, I assume.
Ah this is an important distinction, thanks!
Not sure if helpful but in my experience when something a bit more complex needs to be done, manually making it read the context I know the model will need for it to solve it well (like making it consume all the project docs first) helps with getting a more satisfactory result instead of only giving it the task and let it look around and consume the context it thinks it needs.
Will test your bug finding method in a current project of mine both with my "manual" context preloading and without.