The research from Metr, and my comment, is exclusively related to software development tasks.

Re-reading my comment, I realise I missed the most important part, the question.

What examples can you give of "real world situations" where they fail?

Obviously I don't want to use them for whatever that is.