The 8% one-shot number is honestly better than I expected for a model this capable. The real question is what sits around the model. If you're running agents in production you need monitoring and kill switches anyway, the model being "safe enough" is necessary but never sufficient. Nobody should be deploying computer-use agents without observability around what they're actually doing.