> Couldn't LLM provider just fine-tune their model for these tasks specifically - since they are static - to get ad value?
Yes, this is an inherit problem with the whole idea of LLM's. They're pattern recognition "students" but the important thing, that all the providers like to sell is their reasoning. A good test is a reasoning test. I'll try to find a link and update with a reference.