Ironically, scaling limits and evidence that quality vastly outweighs quantity suggests that all that web data is much less useful than buying and scanning a book. Most work with the Common Crawl data, for example, has ended up focusing on filtering out vast amounts of data as being mostly useless for training purposes.
There was a hot minute in 2023 where it looked like we could just data and compute scale to the moon. Shockingly, it turns out there are limits to that approach.