It seems harsh to critique guardrails and take them into account in the scoring when GPT-5.5 seems to have been explicitly whitelisted to remove most of said guardrails. A more fair comparison would be a vanilla GPT account.

I agree fully and hope someone else is able to do this test! For me it was a matter of cost and quotas that stopped me from changing to a new account.

Also just to mention:

Claude guardrails —> that session terminated.

GPT guardrails -> your whole account is slowed down.

Does it matter when you can’t have the opus 4.8 guard rails removed? With GPT at least you can and they’re quick about it

I mean, yes. Most people aren’t security researchers, and either way it’s apples to oranges at that point if you’re counting “the guardrails stopped me” as a negative for one but not the other.

But should developers be barred from asking an LLM to try secure their own app? Its not different from finding exploits...

That is a completely separate question and discussion.