I don't like that they're not apples to apples; less bits so of course it'll take less tokens.

> Where UUIDs cost ~23 tokens and get hallucinated by LLMs

How does this solve the hallucination problem?

Just removing the - from the example UUID takes it from 26 tokens to 18

LLMs are good at predicting words, since each word in the id is ~1 BPE token. But uuids are random hex characters, this is where LLMs struggle to output the right ids.

You can use the .from method https://github.com/vostride/id-agent/#idagentfrominput-opts

To convert uuid or any text to id-agent based id. Then do the LLM inference and then convert it back to UUID.

But shouldn't you have picked words that also have single token representations for the word with a dash in front? Or are there less than 4096 such words? That would get your token count for the 10 word variant (the most honest benchmark) from 17 tokens to 10

> Just removing the - from the example UUID takes it from 26 tokens to 18

And according to the table below, an id-agent with 120 bits of entropy (still 2 bits less than UUID) uses 17 tokens on average. So unless you purposefully want to reduce the entropy, this whole scheme is just as good as just removing the dashes from UUIDs. But that wouldn't make for a resume-worthy project (sorry, got a bit cynical there)