Someone should try this 10 to a thousand times per model and compare the results . Then we could come up with an average of success/fail...
Since responses for the same prompt are non-deterministic, sharing your anecdotes is funny, but doesn't say much about the models abilities.