Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

It fails all the time - as in it ends up doing something I want to change.

But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

This seems like the obvious correct frame of mind with which to approach these tools. If it works for three hours on a task that would have taken me three work weeks, and 20% of the time it gets the task wrong, then I can just ask it to do it again with adjusted instructions. It will be much more likely to get it right the same time, and I’m still ahead of where I would have been by 14 days and 2 hours.

Or in two words, managing variance.

Play some holdem folks and keep track of how many times you lost with pocket aces.

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.

That’s even smaller then!