Hacker News

Your comment sort of implies that all this is some super standardized flow that is well studied and highly optimized but in my experience all this ML stuff is closer to the edge of broken than some kind of local maximum.

There is an ungodly number of engineering decisions that go into making ML work and any number of stupid things are all over the place that cause stuff to fail.

Like something stupid like your normalization was bad or your mix of data was bad or your learning rates were bad or you have some precision issues or your model has a bad init or some architectural problems cause poor training or straight up there are tons of bugs somewhere like your batching was doing something silly or there is some numerically unstable division or sqrt or somewhere etc etc.

At scale with stupid issues like hardware faults I imagine this only gets exponentially worse.

And then on product sides of integrating stuff more bugs sneak in like so many labs were releasing so many open source LLMs with broken and incorrectly configured chat templates that massively tanked performance.

Or they set up some parmeters in sampling wrong and stuff gets stuck in loops or hallucinates tons or something.

In his 2025 hotchips keynote Noam Shazeer (GDM VP) even says that you need hardware determinism because there are just so many bugs in ML experiments that you need to be able to tweak and test things.

Also there are just so many obvious issues with the way everything works conventionally in GPT2 style like with softmax causing attention sinks at punctuation and creating dispersion over longer sequences because of low sharpness and the whole previllaged basis thing making it so common information takes up a lot of model capacity.