I'm terribly sorry, but scaling curves or GTFO. Any random pile of linear algebra works fine-ish at small scales. Very few random piles of linear algebra push the Pareto envelope at large scales.

Do you want to see scaling curves wrt data and param size? I agree that 1.2B and 10B tokens is not representative, but what scale of parameters and dataset sizes would be convincing?

Not to sound facetious, but perhaps enough runs at different param/token sizings to define a curve?

Not every one can afford millions to publish a paper

That's why you do several small and medium scale tests, fit a curve, and ideally show that the trend persists at several scales. Not a single large or medium run - see the other comments down thread for example sizes.

This exact mentality is cancer for peer review/the industry. We all know who you are if you are using 1000+ TPUs, and yes you do get a "buff" to your peer review scores because people know where you work.

Fuck your scaling curves. More research labs need to #yolo and try stuff that doesn't have good scaling behavior proven yet. State Space models have continued to take forever to proliferate despite being objectively good because only the god dang Chinese understand that you actually need to #yolo sometimes like making some of your layer state space layers in Hunyuan-T1.

Scaling curves don't need to be drawn at particularly enormous parameter counts to be useful! If you can do a 300M and 1.2B run (like the authors do here), then you can do 150M, 300M, 600M, and 1.2B runs with only 50% more resources, and get a much better sense for whether effects seem to amplify or diminish as scale increases.

Exactly. Good peer reviewers understand that you can also move down on the scaling curve, not just up. Also laughable to try a "yolo" run without validating a scaling ladder/curve.