Hacker News

cyberge99 a day ago [ - ]

Forgive me if this is a naive assumption, but wouldn’t large language models be fundamentally different for a language that is largely symbols? Again, my understanding of Mandarin is limited if it exists at all.

doph a day ago [ - ]

All tokens are symbols. All of the frontier models speak Mandarin.

boothby a day ago [ - ]

This is why misspellings and homophones are tells of human righting. LLMs strongly prefer word-level tokens, and word substitutions follow semantic similarity and not the more human auditory similarity.

omneity a day ago [ - ]

Funny, I’ve been cracking[0] at this exact problem with a purpose-built model[1]:

0: https://huggingface.co/posts/omarkamali/593639295164067

1: https://omneitylabs.com/models/sawtone

jddj 21 hours ago [ - ]

Claude the other day wrote code where one of the bytes in the array was 0xO5.

That's zero ex oh (the letter) five

mejutoco a day ago [ - ]

> righting.

> LLMs strongly prefer word-level tokens, and word substitutions follow semantic similarity and not the more human auditory similarity.

Is this an elaborate joke or your full-word misspelling of writing is both agreeing with your statement (word substitutions) and contradicting it (not semantic but only pronunciation similarity)

calfuris a day ago [ - ]

I don't see the contradiction, unless you believe that the grandparent comment was written by an LLM.

a day ago [ - ]

[deleted]

wat10000 a day ago [ - ]

"飞机" and "airplane" aren't fundamentally different in terms of how they're represented to a computer. Especially for an LLM, where tokenization likely turns each of those into a single token.