That's a fair point TBH. I said in my post that this LLM is first of all a learning project and I skipped an important step: the training loop. But on the other hand, how many data scientists are writing their own training loops? Is it even worth it? And how much learning do you want for one project, I mean, where do you stop? Why use "Huggingface Transformers" when you can write it from scratch, for learning? Why use Torch when you can write it from scratch, for learning? Why use Python when you can write in C, etc. It's cheating, right? In my case, I decided to skip the training loop and focus on the data processing and the hyper params and the rest of the higher level steps that took a ton of time anyway, and I reduced the friction. I do get your point tho. Now that I know how to train an LLM, maybe I'll write a training loop from scratch as a project, to learn how to do it.