I think you're missing the point. Everything you said is theoretically correct, but the parent comment was talking about the concrete circumstance of pentesting with the top models today.

Let's just take GPT 5.5 and Opus 4.8 as an example. Both are worse than Mythos 5, but they're capable of quite a bit when the guardrails are lifted and they're paired with a skilled human operator. They more than "good enough" to reach the same result with the addition of some human effort.