Hacker News

On the topic of cost per token, is it accurate to represent a token as, ideally, a composable atomic unit of information. But because we’re (often) using English as the encoding format, it can only be as efficient as English can encode the data.

Does this mean that other languages might offer better information density per token? And does this mean that we could invent a language that’s more efficient for these purposes, and something humans (perhaps only those who want a job as a prompt engineer) could be taught?

Kevin speak good? https://youtu.be/_K-L9uhsBLM?si=t3zuEAmspuvmefwz

int_19h 2 days ago [ - ]

But also, arguably, Lojban is the language you want to use for LLMs. Especially for the chain of thought.

And the interesting property of Lojban is that it has unambiguous grammar that can be syntax-checked by tools and enforced by schemas, and machine-translated back to English. I experimented with it a bit and found that large SOTA models can generate reasonably accurate translations if you give them tools like dictionary and parser and tell them to iterate until they get a syntactically valid translation that parses into what they meant to say. So perhaps there is a way to generate a large enough dataset to train a model on; I wish I had enough $$$ to try this on a lark.

deegles 3 days ago [ - ]

Human speech has a bit rate of around 39 bits per second, no matter how quickly you speak. assuming reading is similar, I guess more "dense" tokens would just take longer for humans to read.

https://www.science.org/content/article/human-speech-may-hav...

__s 3 days ago [ - ]

Sure, but that link has Japanese at 5 bits per syllable & Vietnamese at 8 bits per syllable, so if billing was based on syllables per prompt you'd want Vietnamese prompts

Granted English is probably going to have better quality output based on training data size

int_19h 2 days ago [ - ]

In principle, yes. And you could do the same for programming languages.

In practice, the problem is that any such constructed language wouldn't have a corpus large enough to train on.

It's really unfortunate that we ended up with English as the global lingua franca right at the time generative AI came about, because it is effectively cementing that dominance. Even Chinese models are trained mostly on English AFAIK.

fy20 3 days ago [ - ]

English often has a lot of redundancy, you could rewrite your comment to this and still have it convey the original meaning:

Regarding cost per token: is a token ideally a composable, atomic unit of information? Since English is often used as an encoding format, efficiency is limited by English's encoding capacity.

Could other languages offer higher information density per token? Could a more efficient language be invented for this purpose, one teachable to humans, especially aspiring prompt engineers?

67 tokens vs 106 for the original.

Many languages don't have articles, you could probably strip them from this and still understand what it's saying.

joseda-hg 3 days ago [ - ]

IIRC, in linguistics there's a hypothesis for "Uniform Information density" languages seem to follow on a human level (Denser languages slow down, sparse languages speed up) so you might have to go for an Artificial encoding, that maps effectively to english

English (And any of the dominant languages that you could use in it's place) work significantly better than other languages purely by having significantly larger bodies of work for the LLM to work from

Waterluvian 3 days ago [ - ]

Yeah I was wondering about it basically being a dialect or the CoffeeScript of English.

Maybe even something anyone can read and maybe write… so… Kevin English.

Job applications will ask for how well one can read and write Kevin.

r_lee 3 days ago [ - ]

Sure, for example Korean is unicode heavy, e.g. 경찰 = police, but its just 2 unicode chars. Not too familiar with how things are encoded but it could be more efficient

3 days ago [ - ]

[deleted]