European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.

https://en.wikipedia.org/wiki/List_of_languages_by_number_of...

What makes Portugal's situation unique is that it is a small population that is eclipsed in models by the bigger weights of the much bigger population of Brazil.

Yes, there are much smaller European countries, but those are generally the only source of truth for their specific language, so the context of a LLM query in that language steers the LLM towards facts from that country, for example, if I ask a big generic LLM something in Latvian then it most likely will answer something relevant to the context of Latvia. But Portugal, being the much smaller user of its language, have the somewhat unique problem that if I ask a generic model something in Portuguese it will probably answer something related to Brazil instead of Portugal.

Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.

> Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.

There are ~26x more Portuguese speakers worldwide than in Portugal. Only 13x more Spanish speakers worldwide than in Spain. Depending on how you count (English is really widespread as a native-but-second language), there are about 20x more English speakers worldwide than in the UK.

So yes, Portugal has it pretty bad by the numbers.

I guess Americanisms bleeding over into English LLMs as used in Britain happens similarly.

Should we also be expecting to see bleed-over of Indian English into generic English LLMs? Or is it not relatively large enough compared to America to force it, unlike Brazil to Portugal?

It is pretty small when considering content output. It is only 11 million people, and only a fraction of them will be writing something that could be used on training datasests. If you look at the countries by scientific contribution, for example [1], Portugal is on the 28th position, while Brazil is in 14th by more than double the number of contributions.

Don't get me wrong, it is definitely impressive given Portugal's actual size, but I believe there's a hard limit for population and size that will be difficult to cross

[1]: https://en.wikipedia.org/wiki/List_of_countries_by_number_of...

> European Portuguese is the 13th most populous language in Europe

that's not impressive

Hello from 23rd