I don't really understand what you mean by this. The claim is that the same prompt with the same question produces worse results when it's queried in a model that has more than 200k tokens in its context. That doesn't have to do much with the "skillfulness" of using a model.