Hacker News

As others have pointed out, these phenomena are well known to many folks across companies in the AI infra space. It doesn't really break new ground. This article is a good exposition of the basic strategies though.

What I would have loved is a discussion around collectives/multi-node setups. And showing how to get determinism at low performance penalty for multi-node reduction collectives.