More evidence that the original Transformer authors didn't really know what they were doing, but they did have access to more cheap compute than anyone else.

Can you share the specific part of this work that demonstrates better scaling than original transformers? Also note that many of the changes to that architecture, that have been proven in their use at actual scale, were brought about by members of the original team. Most notably Noam Shazeer.