It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.

I would have to imagine this might not actually be as bad as it seems, at the very least there should be a giant corpus of translated EU texts.

I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.

All in all, I don't think that's a major issue here.

The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).

I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)

[deleted]

Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there.

Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.

> I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).

That's easy to say when you're not on the other end of US defaultism.

To be fair, it is only natural: Portuguese itself only came to be because the Roman Empire conquered the Lusitan land [1], a lot of English comes from Norman French from the Norman conquest [2], the Americas didn't speak European languages until 500 years ago or so, etc.

If you give enough time, all languages will change, and some of them because of major political changes/conquests

[1]: https://en.wikipedia.org/wiki/Paleohispanic_languages

[2]: https://en.wikipedia.org/wiki/Influence_of_French_on_English

[3]: https://en.wikipedia.org/wiki/Indigenous_languages_of_the_Am...

> That's easy to say when you're not on the other end of US defaultism.

I mean, I’m a Brit who lived a long time in the US, so that’s a dynamic with which I am rather familiar

Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources

You're missing African Portuguese.

Portugal has a growing Xenophobic attitude towards immigrants, specially Brazilians and this is reflected in linguistic prejudice.

They have concerns of portuguese children learning to "speak brazillian" because there is a lot more of video content being produced in Brasil than in Portugal and stuff like movies, videogames and software in general are avaliable in brazilian localization/adaptation first.

As portuguese immigrant that has lived in a few European countries I find this growing attitude quite sad.

It starts by we emigrate all over the place, when something happens to a portuguese abroad due to xenophobic attitudes, it is all over the place on the news, they squeeze the juice until there is no more news to talk about.

Then some folks decide to do exactly the same to others that like us abroad, decide to try their luck in Portugal.

And yes, I have experience what means to be shown that Portuguese aren't welcomed.

We have the same thing happening, on multiple levels, here too. First some Spanish parents are afraid the children aren't listening and watching enough Spanish media. Then additionally, some Catalan parents are afraid the children don't get to use Catalan in school so they don't become proficient enough to use it in society.

Spain also took the route of dubbing foreign media, whereas Portugal tends to subtitle instead. This sort of exacerbates the situation, since it means that typically any Portuguese dubs of American media will be Brazilian.

AFAIK there is no Brazilian dubbing in Portugal, the only commonly dubbed media are animated movies and they are always dubbed by Portuguese VA's.

The Catalan situation is completely different and unrelated, being a completely different language and not endangered (with or without scary quotes, as you prefer) by an ex-colony that became independent. Actually many Catalans would like to be such ex-colony.

> The Catalan situation is completely different and unrelated

I'm not saying it's the same, but there is definitively similarities in that parents are worrying about what language their children use. And yeah, unrelated, wasn't trying to claim it's the same or better/worse or anything, just another similar situation other (curious) people might want to learn more about, regardless of what you think Catalan wants or not.

As a father of 3, I’m kind of guilty of that prejudice myself.

It is not towards Brazilians themselves, which I frankly respect, but because of the low quality stuff my kids are exposed to on the internet. You just can’t avoid it and of having the kids gravitate towards YouTube instead of better entertainent channels.

Random dumb YouTubers doing shit for giggles and overly sexualized funk music. And the shorts, oh the shorts crap everywhere, in all languages.

I don’t have a problem with my kids watching “manual do mundo” or “Paula stefania”. Or “porta dos fundos” myself. Good stuff.

On the other hand, Apple developer relations suports Brazilian Portuguese only, when they do not distinguish between variants of English, French and Spanish: I had to submit an English translation of a plain simple and clear document because my Portuguese version was rejected.

The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.

Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.

What's your evidence for that?

And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English or a mixture of languages, which is essentially what they did by starting with EuroLLM?

Evidence? Not so much, I didn't realize I was defending a PhD thesis here.

I speak Spanish, and have talked with people who only speak Portuguese, either of the variants, and also talked with Portuguese people before how they see their language, comparing it with Brazilian Portuguese, and vice-versa. So basically based on vibes and experience.

> And if the first 80% doesn't bias the language after post-training (which I think is what you're claiming) why not go for English

I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences. Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

> I'm not sure how many languages you speak or encountered in the wild before, but some languages are VERY different from each other, some are a bit different and others are basically the same with some differences.

I'm a dual citizen of Portugal and Brazil and I live in the US now, so that's my linguistic background. (Also studied bits of French, Russian, Latin and Greek.)

> Doing what I describe for languages that are similar is easier than languages that are very different, for what I hope are obvious reasons.

Not only are your reasons not obvious, your conclusion is actually wrong.

If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals), it might actually make more sense to train it in any other language BUT Brazilian Portuguese (say, English), then fine-tune it for European Portuguese.

LLM's have shown to be very good at generalizing across languages (the transformer architecture literally comes from work on translators IIRC).

> If the goal is to create an LLM with minimal Brazilian Portuguese bias (which was one of their main goals)

Oh, I wasn't aware that was their goal, would certainly be intuitive to avoid Brazilian Portuguese if that's the case, although I'm still not sure it actually makes sense to 100% avoid it for pre-training even if you're trying to avoid Brazilian bias, you can "skew" things pretty heavily in post-training if you so wish.

Where can I read more about this goal, because it doesn't seem to be mentioned in the submission article, just a short off-hand about one of the benchmarks, so I'm guessing there is some resource they talk more about the specifically perhaps?

Never heard the term Lusophone before. TIL

It is broader than that, it means all the countries that speak Portuguese as official language, there are quite a few.

Usually only Portugal and Brazil come up in conversations.

In reality the list is wider, https://en.wikipedia.org/wiki/Portuguese-speaking_world

African Portuese is also closer in way of speaking to European Portuguese than Brasilian Portuguese, as we also tend to share some common slang that comes in from creole.

Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.

I agree, they're not the same. But they're far closer than other languages who don't come from the same families.

European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.

https://en.wikipedia.org/wiki/List_of_languages_by_number_of...

What makes Portugal's situation unique is that it is a small population that is eclipsed in models by the bigger weights of the much bigger population of Brazil.

Yes, there are much smaller European countries, but those are generally the only source of truth for their specific language, so the context of a LLM query in that language steers the LLM towards facts from that country, for example, if I ask a big generic LLM something in Latvian then it most likely will answer something relevant to the context of Latvia. But Portugal, being the much smaller user of its language, have the somewhat unique problem that if I ask a generic model something in Portuguese it will probably answer something related to Brazil instead of Portugal.

Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.

> Maybe the UK and Spain have somewhat similar struggles, but I suspect that none has it as bad as Portugal in that regard.

There are ~26x more Portuguese speakers worldwide than in Portugal. Only 13x more Spanish speakers worldwide than in Spain. Depending on how you count (English is really widespread as a native-but-second language), there are about 20x more English speakers worldwide than in the UK.

So yes, Portugal has it pretty bad by the numbers.

I guess Americanisms bleeding over into English LLMs as used in Britain happens similarly.

Should we also be expecting to see bleed-over of Indian English into generic English LLMs? Or is it not relatively large enough compared to America to force it, unlike Brazil to Portugal?

It is pretty small when considering content output. It is only 11 million people, and only a fraction of them will be writing something that could be used on training datasests. If you look at the countries by scientific contribution, for example [1], Portugal is on the 28th position, while Brazil is in 14th by more than double the number of contributions.

Don't get me wrong, it is definitely impressive given Portugal's actual size, but I believe there's a hard limit for population and size that will be difficult to cross

[1]: https://en.wikipedia.org/wiki/List_of_countries_by_number_of...

> European Portuguese is the 13th most populous language in Europe

that's not impressive

Hello from 23rd