I use to think this but no one I have read believes data is the problem.

Amodei explains that if data, model size and compute scale up linearly, then the reaction happens.

I don't understand why data wouldn't be a problem but it seems like if it was, we would have ran into this problem already and it has already been overcome with synthetic data.