Can I ask a stupid question? Why is this so much worse than what they were doing to gather articles for traditional search engines? I assume that they are gathering pretty much the same data? It is the same articles, no?

— I just realized these are callouts from the LLM on behalf of the client. I can see how this is problematic but it does seem like there should be a way to cache that

No, the traffic is not caused by client requests (like when your chat gpt session does a search and checks some sources). They are caused by training runs. The difference is that AI companies are not storing the data they scrape. They let the model ingest the data, then throw it away. When they train the next model, they scrape the entire Internet again. At least that's how I understand it.

There's many factors but the largest are that it comes down to the fact there weren't many search companies, and they weren't that well capitalised. This meant there wasn't really competition for "freshness" in your results. There are many many many AI companies, and even more AI data companies providing the data to those doing the actual training.

Finally search engines don't actually cache all the text, but do something akin to calculating embeddings/keywords and stuff like pagerank which just uses links. AI companies however want ALL the text/image/video data, and it's too expensive to store this all. It is however cheap to download it every time you need it. (Data ingress is usually free, as opposed to data egress)