Hacker News

This model is a waste of Public Funds.

There is no public website to use it, be it free or paid, the dataset is not public, the code is not public (The github URL in the article returns 404 ), the claimed model intelligence is so low that is pretty much useless at 32K context and massively inferior to GPT‑4o.

As per tradition in Portugal, some people managed to get 5.5 Million to produce nothing and no one is asking questions.

You want a better idea? Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset, the cost would be under a million and we would be getting something useful.

It would be really nice to know what happened to 5.5 Millions whilst not being able to even provide a functional website to use the model.

pjmlp 2 hours ago [ - ]

As Portuguese living abroad, reading the article made me think exactly the same , even though I wasn't even aware of the project.

upupupandaway 17 hours ago [ - ]

As a pt-BR speaker from across the pond: https://soberania.ai/

Similar waste.

edwcross 8 hours ago [ - ]

Interestingly, the mobile version of the website contains a hamburger menu with an "Equipe" (Team) link that returns a 404 error (https://soberania.ai/equipe).

This link is absent from the desktop version.

Isn't it a bit odd that the team responsible for it is nowhere to be credited?

avdelazeri 14 hours ago [ - ]

Given that their publication says the dataset is freely available on Huggingface that's at least something ig

gverrilla 14 hours ago [ - ]

Why?

dr_dshiv 18 hours ago [ - ]

It’s a way to suck all the money out of the room in the name of nationalism — and it’s all over Europe. Only idea everyone has had.

vova_hn2 17 hours ago [ - ]

I'm not arguing with the rest of your points, but...

> Just fine tune the open source Kimi 2.6 with an open source Portuguese dataset

I think that tokenizers of all popular models are heavily biased towards English or English and Mandarin.

And I don't think that it is possibple to replace the tokenizer without full retraining.

mcyc 17 hours ago [ - ]

You are right about most tokenizers being heavily biased towards English, but the situation is not so bad for Portuguese. Here are some results on the Goldfish corpus [1] with a few different tokenizers. This measures #characters in corpus / #subwords in tokenized corpus.

```

Llama3

english, 0.216

portuguese, 0.285

italian, 0.287

greek, 0.592

```

Gemma4

english, 0.219

portuguese, 0.246

italian, 0.249

greek, 0.537

```

Kimi2.6

english, 0.214

portuguese, 0.310

italian, 0.308

greek, 0.716

```

Portuguese is worse than English certainly, but it is on par with Italian (which I think has more overlap with English) and much better than Greek (since it doesn't use the Latin script and is definitely not prioritized in the tokenizer construction).

On your second point, tokenizer transfer allows for extending/modifying a tokenizer without retraining the model from scratch. The simplest version of this is tokenizer extension + continual pretraining, where you just add a bunch more tokens to the vocab for the language/domain that you want to improve and train a little more. It's been done for Japanese [2] and Indic languages, but afaik not Portuguese.

So I think that continual pretraining for a large base model would have probably been fine for this case with huge cost savings. But it is good to have the ability to train your own base models, so I don't think this is such a bad idea.

-----------------------

[1]: https://huggingface.co/datasets/goldfish-models/fish-food

[2]: https://arxiv.org/abs/2404.17790