I have no submission for this but I joined the hype in my own way by optimizing the training loop. These tiny models are not really well suited to frameworks like pytorch, and with highly patient AI agents we can now just inline the whole thing into C++ just to see what happens, which I do below:

https://www.reidatcheson.com/transformer/llm/ml/cuda%20graph...