Could this task be a nice benchmark for computer use models?

Would interesting to see the success rate for Claude Cowork or Codex’s equivalent feature.

Good point, could be a solid benchmark. Sites are adversarially built to resist automation and success is verifiable later when records actually disappear, so harder to game than WebArena.