Hacker News

Kimi and Qwen come out of China, which means that their training material may be biased e.g. relating to Taiwan [1]. In addition, there is no way to determine what input went into the training, if it was properly licensed, if it was legal (e.g. not contaminated by CSAM), or how the human component of RLHF was sourced - in US models, for example, stories about exploitation like [2] have been floating for years.

Assuming us Europeans finally get our act together, I think it is better for our long-term future (and the ethical problems) if we manage to get a baseline of training input and data ourselves, from scratch, with everything being ethically sourced.

Oh and, while we're at it, the EU has 24 official languages plus a host of minority languages. Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best. An European model with actual funding and proper data sources might be able to significantly reduce that.

[1] https://www.taiwannews.com.tw/news/6245677

[2] https://www.theguardian.com/technology/2024/apr/16/techscape...

vintermann 13 hours ago [ - ]

The Chinese models are almost certainly taught to comply with "Chinese values" in the RLHF step, not from filtering the training data. There may be a few things which are too radioactive to be allowed even in the training material - but that's more likely to be things like child abuse images for a visual model, things non-Chinese values also have an issue with.

I'm pretty sure no county taking a stab at making their own model for sovereignty purposes will let "proper licensing" stand in their way.

jampekka 13 hours ago [ - ]

> Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best

Current frontier models (closed and open) are already really good at small languages too. I use them in Finnish sometimes, and the language is immaculate. They underestand even somewhat obscure dialects. Multilinguality seems to be a mostly solved problem.

KronisLV 11 hours ago [ - ]

This already exists https://eurollm.io/

How do people not know about it and keep making stuff from scratch?

Alexander-Barth 9 hours ago [ - ]

I did not know about EuroLLM. I had a look to the paper (https://arxiv.org/abs/2602.05879) describing it:

Specifically, we discard documents shorter than 200 characters (Xue et al., 2021a), and any page containing the phrase “lorem ipsum,” the word “javascript,” or curly brackets (Raffel et al., 2023)....

It is quite surprising/funny to see all documents with javascript removed.

gnerd00 20 hours ago [ - ]

> Most LLMs focus on the English, German, French and Chinese languages, but everything else is... left behind at best.

that is not true, so please read before make an opinion. The French Mistral project shipped seven+ years ago with 140 languages for example.. language translation was the first LLM task from 2015

selcuka 18 hours ago [ - ]

One example is not the same as "most LLMs". My experience is the same with most LLMs. Especially the smaller ones are English oriented (probably makes sense given the size constraints).

altmanaltman 17 hours ago [ - ]

It really doesn't matter if the model sucks and doesn't perform well. Given the funding amount and their lofty ambitions, it seems very unlikely they will be able to pull it off properly.

Yeah China and US models have baises but so will any model. The biases do not get in the way of the product though. You don't open those models just to ask for what happened in Taianaman square or if Taiwan is a state. You dont ask ChatGPT to generate CASM. But they are very good at the tasks you actually expect from a LLM. If you fail at that, nobody will use your model no matter how "ethically sourced" a colonizer-based entity like Europe made it.

edg5000 15 hours ago [ - ]

> no matter how "ethically sourced" a colonizer-based entity like Europe made it

The attempt is laughable, buy every country should at least try to keep up with frontier technology, even if they fail massively or are massively underfunded.

On the other hand, it's arguably wasteful for an incompetent govt to do something like this, since the money will almost certainly not be well spent. It will just go to people good with MS Word. That's the likely failure mode for such NL innovation projects. The actual solution is a culture shift, but that is much harder if not impossible to pull off and requires decades. But we (NL people and govt) should work towards that. Most likely all these govt led innovation attempts are a sad waste of tax money.

bigfudge 13 hours ago [ - ]

The culture shift that has generated this is the same one that causes the other story on HN this morning about xAIs gas generators being a national security issue. Ie one towards corruption graft and the public ill.

I don’t want Europe to model itself on the US, whatever the economic gain. Hopefully we are large enough to find a third way between China and the US.

dr_dshiv 21 hours ago [ - ]

There is something north of 8% OCR error rates.. that will hurt model quality!

siva7 21 hours ago [ - ]

Uh, some would say it's easy to determine what input went into the training for kimi and qwen.. since they were caught stealing it from American labs. Some cultural cliches may never change.

janc_ 20 hours ago [ - ]

It's well-known that all commercial models are based on stolen content. That doesn't mean there is no filtering/censoring, just that the censoring likely depends on where it's happening…

> It's well-known that all commercial models are based on stolen content.

Does that mean that Chinese models are the "Robin Hood"s of the AI era?

ignoramous 20 hours ago [ - ]

> since they were caught stealing it from American labs. Some cultural cliches may never change.

Has a formal lawsuit been brought to bear? Given, Anthropic & OpenAI are being dragged through courts for copyright violation (or stealing, as you'd call it, if the companies involved were culturally Chinese) by newspapers, publishing houses etc; one'd think they'd pass on some of that medicine to Alibaba, which does have business entities registered in the US.

kouteiheika 15 hours ago [ - ]

> since they were caught stealing it from American labs

...and "good guys" the American labs were caught stealing from authors all over the world[1].

[1]: www.npr.org/2025/09/05/g-s1-87367/anthropic-authors-settlement-pirated-chatbot-training-material

j_french 12 hours ago [ - ]

.... Anthropic began buying books in bulk, tearing off the bindings and scanning each page before feeding the digitized versions into its AI model, according to court documents.

Wow. This image of Anthropic employees ripping books apart to use them to train models is a powerful one, seems like an inflection point in the history of information.

basisword 10 hours ago [ - ]

>> Some cultural cliches may never change.

Let’s just gloss over the monstrous amount of copyrighted and pirated material the American labs trained on. China bad. American good. Some cultural cliches never change.

mschuster91 8 hours ago [ - ]

How about, both China and the US bad, Europe at least somewhat decent because we lack the financial incentives to behave like utter arseholes?