I think at this point what the Netherlands, and any other country that wants a good model in their language should do, is gather up every piece of text ever written in that language and license it to the big AI labs/companies for training. I'm sure there are vast libraries of books and other text that haven't been digitized and aren't a priority for the big labs.
I think they should just make a national security thing and gather every piece of text in every language.
Yeah except replace "license it to the companies for training" with "pay the companies to train on it"
Oh I didn’t mean at all charging them. I mean licensing in the sense of granting rights for the purpose of training. Probably most labs would be fine adding the language to the training for free as long as the dataset quality is high and it improves the results. But yes, pay them if that’s what it takes for them to use it.