training inspired on nanochat for diffusion models: https://github.com/ZHZisZZ/dllm

now someone needs to make it work with vllm or something