That is my experience too. I don't know what others are building but the more novel the task is the worse these models perform.