In my opinion, we need more models trained on fully traceable and clean data instead of closed models that we later find out were trained on Reddit and Facebook discussion threads.

I want to see something trained _only_ on stuff like encyclopedias, programming books, etc. I'm interested in how different it would be compared to something with a lot of social media in it.

Better to do a fine tune or a LoRA than a full retraining from scratch