The ease of dealing with arbitrary bit-width integers and packed structs is actually one of the 'killer features' for me in zig.
Zig natively supports arbitrary bit-width integers, the ABI is defined and you could simply think it as a slice of the next larger backing integer.
The[3]u8 to u24 bitCast will simply be backed by a 32bit int, using the same ABI. As you have u1 - u65535, sometimes it can be multiple words.
The 24 Bits (3 Bytes) [3]u8 to u24 example is exactly related to utf-8 that covers all the languages but excludes the emojis.
There are very valid use cases when you want to limit utf-8 to U+0000-U+FFFF, and it is valuable if your language allows you to make those decisions.
Remember, in zig packed structs are just integers and integers are just a group of logically consecutive bits.
Arrays like []u24 do not have the same ABI, arrays are not bit/byte packed, are not universally LSB across archs etc..
The compiler isn't producing unaligned code, don't confuse the abstraction with the concrete implementation. And yes [8]u1 and [8]u8 are exactly the same size and shape, even though they are arrays.
My current project is parsing ELF/Macho files, I can easily have zero allocations in my hot path with zig, the same is far more challenging in C, so I am biased, especially with zig allowing methods on structs.
And yes, I do use that crazy casting to 0xdeadbeef and other ascii metadata that is in those files.
To be clear here, I am not trying to prove you wrong, this is one of the places zig is very different and (IMHO) useful. Especially with streaming data or where you have network ordering etc... It is so nice to only cast what you need to but it does take a little while to wrap your head around how this interacts with buffers which are not your native endianness. At least for me, once I figured out to separate the shape of those data streams from their values it was super useful.
> The 24 Bits (3 Bytes) [3]u8 to u24 example is exactly related to utf-8 that covers all the languages but excludes the emojis.
I'm not familiar with Zig, so maybe it's doing something weird here, but that doesn't really make sense with Unicode in general.
First, the largest Unicode codepoint that will ever be allocated is U+10FFFF [0], which is less than 2^21, so all Unicode characters will fit in a 24-bit integer. Perhaps you're thinking of UCS-2 or UTF-16 without surrogates, which are both 16 bits wide and are limited to the BMP [1] [2] (and therefore don't include most emojis).
Second, while the characters needed for most languages lie within the BMP, not all of them do [3], so it isn't really possible to support all languages while excluding emoji, aside from using the Unicode character database to exclude certain categories [4] [5].
[0]: https://www.unicode.org/faq/utf_bom.html#gen0
[1]: https://www.unicode.org/faq/utf_bom.html#utf16-11
[2]: https://en.wikipedia.org/wiki/Universal_Coded_Character_Set
[3]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Supplementary_...
[4]: https://www.unicode.org/reports/tr44/tr44-34.html#General_Ca...
[5]: https://en.wikipedia.org/wiki/Unicode_character_property#Gen...
Note the utf-8[0] in my response, the answers are on the pages you linked, but not in the sections you linked,
utf-8 encodes code points in one to four bytes, it is byte oriented vs utf-16 etc. In zig u8 is a byte, and is also (by convention) a char, although there isn't an explicit char type in zig. Technically there are chars in languages that need all 4 bytes in utf-8, but almost all of them are historical or emoji's in utf-8.
24bits (3 bytes) in utf-8 gets you Chinese, Japanese, Korean. 16 bits (2 bytes) gets you Latin letters with diacritics, Greek, and Arabic scripts. With 8 bits (1 byte) getting you Standard ASCII etc...
There is a point you could make that it may have been better to use utf-16 etc... and that we should have dropped ascii/latin-1 support, but once again go up to the 'Basic Multilingual Plane' in your [3] and notice that is covered by 24bits (3 bytes) in utf-8 encoding.
[0] https://en.wikipedia.org/wiki/UTF-8
> ... but almost all of them are historical or emoji's in utf-8.
I just posted a comment, five minutes after you wrote that, which I won't repeat here since it was quite long. But one of the languages whose alphabet is found in the higher multilingual plane is Fulani, spoken natively by 37 million people (plus another two and a half million who have learned it as a second language). While it can be written in other alphabets (both Latin and Arabic have been used to write it in the past, for example), other alphabets don't usually represent all the sounds of the language properly, making it awkward. There's a reason why the Adlam script was invented to write Fulani with; and that invention was recent enough that it was assigned the U+1E900 to U+1E95F block, since the basic multilingual plane was full by then.
So although it's easy to think that the astral planes are only used for emoji and historical languages, that's not actually true. There are languages spoken by millions of people in those astral planes as well (yes, languages plural; Fulani isn't the only one, it's just the largest).
To be clear, I was talking about a use case, not all use cases.
There are very real times where you have to support all 4 bytes, there are others where other drivers require you to restrict the domain of discorse.
It doesn't change the value/cost of bit casting in a language with arbitrary bit width languages, especially when combined with the fact that int overflows are detectable illegal behaviour and you have saturating and wrapping operators.
This is in addition to the ease of using packed structs I mentioned above.
A list of some advantages:
* Zig's arbitrary-sized integers have a fully defined ABI for padding
* Allows for strict domain modeling using them as platform independent refinement types
* Precise memory packing, allowing more utilization of register space etc...
* OOB compile time checks
* Bit masking optimization, where sequential changes to packed values are often merged into a small number of and/or masks
To move to a more information theory example:
DNA nucleotides (A, C, G, T) represents quaternary state pairs.
If you wanted to store an array of 1,000 DNA nucleotides, each symbol is one of 4 bases, requiring exactly 2 bits of information. The Shannon Information would be: 1000 * 2bits = 2000 bits.
With uint8_t this would take 8k bits, vs 2k bits of u2. That is 300% more for uint8_t.
It is still horses for courses, but as an example consider 12-bit sensor reading in a standard u16, the data type allows invalid states. To ensure safety, requires manual defensive logic throughout your program in the traditional C/Rust/...
That traditional model in zig:
And the lower overall Kolmogorov complexity (cherry picked) form: C23 does have _BitInt types for structs which can help if bit packing is your primary need, IMHO it doesn't offer the same advantages.As an example, and I may be wrong, but I think you cant easily perform checked arithmetic or use standard overflow operations on individual C bit-fields without copying them out into standard standard types (like int), modifying them, masking them, and copying them back.
With Zig the invariant is maintained implicitly at the type layer, removing runtime validation branches, error paths, and testing code
Does it solve all problems, no. Is @bitCast, a zero runtime overhead, compile-time checked bit reinterpretation and [3]u8 \to u24 useless and silly, no.
Yes, there are certainly use cases where you know the data you're parsing will only come from a narrow range of Unicode, such as U+0000 to U+007F — or from just the letters GCAT, as you mentioned. The overhead of converting 8-bit input to 7-bit might not be worth the cost, but the benefit of storing your input in just 2 bits per "letter" is definitely worth it.
I mostly wanted to make sure people know that the upper multilingual planes are a very real use case, and you need to test them. This is more important for languages such as C# where UTF-16 is the norm: many programmers don't know that they're handling surrogate pairs wrong until someone tries to backspace over an emoji character and it turns into something weird. It's probably less relevant to Zig, which didn't make the mistake that C# and Java did by starting out with UCS-2 (to be fair to them, they were designed in the era where people still thought that 65,536 codepoints would be enough for every language and Unicode would never need more than 16 bits). But the upper planes are important, and need to be tested no matter what language your code is written in.
> utf-8 encodes code points in one to four bytes, it is byte oriented vs utf-16 etc. In zig u8 is a byte, and is also (by convention) a char, although there isn't an explicit char type in zig. […]
> 24bits (3 bytes) in utf-8 gets you Chinese, Japanese, Korean. 16 bits (2 bytes) gets you Latin letters with diacritics, Greek, and Arabic scripts. With 8 bits (1 byte) getting you Standard ASCII etc...
Ah ok, so if I understand you correctly, you're taking a variable-length encoding (UTF-8), and limiting and/or padding it to 3 octets (24 bits)? In that case, what you said in your original post makes sense, but I'm not really sure why you'd ever want to encode something this way: you have to deal with the complexities of a variable-length encoding to parse each u24, you have the poor space usage of a fixed-length encoding, and you're using 24 bits to encode only 0xFFFF characters (even though you can fit all of Unicode in only 21 bits).
> Technically there are chars in languages that need all 4 bytes in utf-8, but almost all of them are historical or emoji's in utf-8.
Yes, the majority of the characters in the non-BMP planes are for archaic languages, but that's not really the right way to look at it, since most languages only need <100 characters, and there are more dead languages than living ones. Instead, I'd look at it from the reverse lens of how many living languages need non-BMP characters. This sibling comment [0] gives one example, but there are lots more [1] [2] [3] [4] [5] [6].
Now, it's fine to not support these characters, but the argument in that case should be that you've decided that the characters aren't important enough to outweigh the technical challenges, not that nobody needs the characters.
> 24bits (3 bytes) in utf-8 gets you Chinese, Japanese, Korean.
It gets you a subset of CJK that's probably sufficient for many purposes, but there are nearly 75k CJK characters outside of the BMP.
> There is a point you could make that it may have been better to use utf-16 etc... and that we should have dropped ascii/latin-1 support, but once again go up to the 'Basic Multilingual Plane' in your [3] and notice that is covered by 24bits (3 bytes) in utf-8 encoding.
If you are willing and able to use a 24-bit encoding, then I'd argue that you should just use UCS-3/UTF-24, since those allow you to encode every Unicode character. The only downside is that these encodings aren't formally-defined so other programs won't understand them, but if that's an issue you can use UCS-4/UTF-32.
[0]: https://news.ycombinator.com/item?id=48682043
[1]: https://en.wikipedia.org/wiki/Unified_Canadian_Aboriginal_Sy...
[2]: https://en.wikipedia.org/wiki/Chakma_(Unicode_block)
[3]: https://en.wikipedia.org/wiki/Mro_(Unicode_block)
[4]: https://en.wikipedia.org/wiki/Kirat_Rai_(Unicode_block)
[5]: https://en.wikipedia.org/wiki/Nag_Mundari_(Unicode_block)
[6]: https://en.wikipedia.org/wiki/Ethiopic_Extended-B
> ... utf-8 that covers all the languages but excludes the emojis ...
Ah, but the U+0000 to U+FFFF plane does not cover all the languages. You might think that only historical and archaic languages are found in Unicode's astral planes (e.g., U+20000 to U+2A6DF is used for historical Chinese characters no longer used today), but in fact there are modern languages found in the U+10000 plane.
You might not care about Osage (the language of the Osage Nation of northern Oklahoma) since its last native speaker passed away in 2005, but there is a revival program trying to teach Osage to people. Osage's script was developed quite recently as part of the revival program, so it couldn't fit into the U+0000 to U+FFFF block and it was assigned U+104B0 to U+104FF.
The Toto language of Bengal, on the other hand, is still active: over 1000 speakers, all living in the village of Totopara. It also never had an alphabet until recently, so its Unicode block is U+1E290 to U+1E2BF.
Then there's Wancho, spoken by about 60,000 people in India. Its alphabet was created between 2001 and 2012, and added to Unicode in 2019. It was assigned the U+1E2C0 to U+1E2FF block (immmediately after the Toto language, you might notice).
Then there's the Ho language spoken by over a million people in India. Wikipedia cites a 2001 census as having 2.2 million speakers, and a 2011 census as having 1.4 million speakers. I very much doubt that both of those are accurate (you don't lose half a million people from an ethnic group in just ten years without some kind of war or genocide, and the Wikipedia article would have at least mentioned that if such a thing had happened), but to be safe, let's go with the lower estimate and say that at least one and a half million people speak Ho. It can be written with the Latin alphabet, but its own alphabet is Warang Chiti (sometimes spelled Warang Citi), which was added to Unicode in 2014 and assigned the U+118A0 to U+118FF block.
And then there's the Adlam script for writing Fulani, the language of the Fufulde people of western Africa. Fulani is spoken natively by 37 million people, and as a second language by another 2.7 million. Adlam's Unicode block is U+1E900 to 1+1E95F.
So if you restrict your program to only working with the basic multilingual plane, it's not just emoji you'll be leaving out. It's also modern languages, spoken by anywhere from 1000 people to 37 million. How many speakers of a language are enough to draw the line and say "No, I won't ever translate my software into your language"?
Now, if your software is only targeting one language and you never intend to translate it, then yes, you'll only lose out on emoji if you stick to the U+0000 to U+FFFF range of the basic multilingual plane.
But realize that the higher planes are not just for dead languages. Living languages have ended up there too, and there are likely to be more in the future. It's quite possible that right now, someone somewhere is saying "Hey, why doesn't my language have its own alphabet instead of using Latin characters to write it? The Latin characters don't express the sounds of my language very well." And when they do get that alphabet worked out and manage to get it accepted into Unicode, it'll certainly land in one of the higher planes. Most likely the U+10000 to U+1FFFF plane which isn't at all full yet, but who knows. If you want to be able to handle every language spoken (and written) in the world today, you must be able to accept the full range of Unicode, not just the 16-bit range.
> You don't lose half a million people from an ethnic group in just ten years without some kind of war or genocide.
Nothing happened to the people, they are growing year on year. But languages can die very easily if governments don't put efforts on teaching it to children. That is exactly what happened to the Ho language. There is no advantage on learning these small regional languages so children put their effort on more popular languages like Hindi, Odia and English.
Here is a good article on this topic:
https://www.vogue.in/content/when-languages-in-india-disappe...
I'm familiar with the phenomenon, as my wife is a linguist who did her master's thesis on the phonology of a small language spoken by about 7000 people: many of the kids don't want to learn it, and just want to learn the majority language of the country since that's what they have to use in school. But I didn't think that could be the explanation for a 25% decline in ten years: new people may not be learning the language, but the only way people stop speaking their mother tongue is if they immigrate to a new country and fully adapt to it (happens to a few people, usually who immigrated as children) or if they die (by far the most common reason for language-use decline: the old people are dying and the young people aren't learning it). If the decline was a couple hundred thousand that would be the outside limit of probability, as far as I know.
More likely, in my opinion, is that both are happening: yes, the language is declining, but either the earlier census overcounted speakers (e.g. counting children as speaking it when they weren't actually learning it) or else the later census undercounted speakers; either way the language decline would look larger than it actually is. Given that Ethnologue (https://www.ethnologue.com/language/hoc/) rates the language vitality as "Stable" — "The language is not being sustained by formal institutions, but it is still the norm in the home and community that all children learn and use the language" — and they usually know what they're talking about, I suspect the language decline isn't that fast and a census counting mistake is a more likely explanation for the discrepancy over ten years.