That's not really a fair test because you're leading the model pretty hard, even if the prompt doesn't specifically say there's a bug to be found. It's basically the same objections that people raised in the thread where someone claimed current models are just as good as mythos.
right exactly, but clearly it's possible to elicit the behavior we want in the model, which means the capabilities are there!
The more interesting question is, how many issues will this prompt report to you in random code that is perfectly fine?