What the source article claims is that small models are not uniformly worse at this, and in fact they might be better at certain classes of false positive exclusion. This is what Test 1 seems to show.
(I would emphasize that the article doesn't claim and I don't believe that this proves Mythos is "fake" or doesn't matter.)