Scaling curves don't need to be drawn at particularly enormous parameter counts to be useful! If you can do a 300M and 1.2B run (like the authors do here), then you can do 150M, 300M, 600M, and 1.2B runs with only 50% more resources, and get a much better sense for whether effects seem to amplify or diminish as scale increases.

Exactly. Good peer reviewers understand that you can also move down on the scaling curve, not just up. Also laughable to try a "yolo" run without validating a scaling ladder/curve.