depends on bet size. small scoped tasks with tight specs — agents are reliable. "build this feature" with no constraints — yeah that's gambling. I am 90% positive most agent failures I see are from vague task definitions, not model limitations. basically the fix is better scoping not better models