The "best" model finds 4/9 bugs. It would be interesting to see if all models find the _same_ bugs. Does a collection of models exist that can cover all 9?
Also, it seems to me that pointing a model to a bug and asking it to solve it is somewhat easier than what Mythos did, which if I understand correctly, was to generally look at a codebase and find any bug. Even so, non-Mythos models only managed to fix 4/9 of these bugs.
I think the article makes the point that Mythos is at a different level.