UTF-8 is indeed a genius design. But of course it’s crucially dependent on the decision for ASCII to use only 7 bits, which even in 1963 was kind of an odd choice.
Was this just historical luck? Is there a world where the designers of ASCII grabbed one more bit of code space for some nice-to-haves, or did they have code pages or other extensibility in mind from the start? I bet someone around here knows.
I don't know if this is the reason or if the causality goes the other way, but: it's worth noting that we didn't always have 8 general purpose bits. 7 bits + 1 parity bit or flag bit or something else was really common (enough so that e-mail to this day still uses quoted-printable [1] to encode octets with 7-bit bytes). A communication channel being able to transmit all 8 bits in a byte unchanged is called being 8-bit clean [2], and wasn't always a given.
In a way, UTF-8 is just one of many good uses for that spare 8th bit in an ASCII byte...
[1] https://en.wikipedia.org/wiki/Quoted-printable
[2] https://en.wikipedia.org/wiki/8-bit_clean
"Five characters in a 36 bit word" was a fairly common trick on pre-byte architectures too.
5 characters?
I thought it was normally six 6bit characters?
The relevant Wikipedia page (https://en.wikipedia.org/wiki/36-bit_computing)indicates that 6x6 was the most common, but that 5x7 was sometimes used as well.
... However I'm not sure how much I trust it. It says that 5x7 was "the usual PDP-6/10 convention" and was called "five-seven ASCII", but I can't find the phrase "five-seven ASCII" anywhere on Google except for posts quoting that Wikipedia page. It cites two references, neither of which contain the phrase "five-seven ascii".
Though one of the references (RFC 114, for FTP) corroborates that PDP-10 could use 5x7:
To me, it seems like 5x7 was one of multiple conventions you could store character data in a PDP-10 (and probably other 36-bit machines), and Wikipedia hallucinated that the name for this convention is "five-seven ASCII". (For niche topics like this, I sometimes see authors just stating their own personal terminology for things as a fact; be sure to check sources!).I like challenges like this. First, the edit that introduced the "five-seven ascii" is [1] (2010) by Pete142 with the explanation "add a name for the PDP-6/10 character-packing convention". The user Pete142 cites his web page www.pwilson.net that no longer serves his content. Sure it can be accessed with archive.org and from the resume the earliest year mentioned is 1986 ( MS-DOS/ASM/C drivers Technical Leader: ...). I suspect that he himself might have use the term when working and probably this jargon word/phrase didn't survive to a reliable book/research.
[1] https://en.wikipedia.org/w/index.php?title=36-bit_computing&...
You do better with a search for "PDP-10 packed ascii". In point of fact the PDP-10 had explicit instructions for managing strings of 7-bit ascii characters like this.
I've run into 5-7 encoding in some ancient serial protocol. Layers of cruft.
That was true at the system level on ITS, file and command names were all 6 bit. But six bits doesn't leave space for important code points (like "lower case") needed for text processing. More practical stuff on PDP-6/10 and pre-360 IBM played other tricks.
Not an expert but I happened to read about some of the history of this a while back.
ASCII has its roots in teletype codes, which were a development from telegraph codes like Morse.
Morse code is variable length, so this made automatic telegraph machines or teletypes awkward to implement. The solution was the 5 bit Baudot code. Using a fixed length code simplified the devices. Operators could type Baudot code using one hand on a 5 key keyboard. Part of the code's design was to minimize operator fatigue.
Baudot code is why we refer to the symbol rate of modems and the like in Baud btw.
Anyhow, the next change came with instead of telegraph machines directly signaling on the wire, instead a typewriter was used to create a punched tape of codepoints, which would be loaded into the telegraph machine for transmission. Since the keyboard was now decoupled from the wire code, there was more flexibility to add additional code points. This is where stuff like "Carriage Return" and "Line Feed" originate. This got standardized by Western Union and internationally.
By the time we get to ASCII, teleprinters are common, and the early computer industry adopted punched cards pervasively as an input format. And they initially did the straightforward thing of just using the telegraph codes. But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.
So zooming out here the story is that we started with binary codes, then adopted new schemes as technology developed. All this happened long before the digital computing world settled on 8 bit bytes as a convention. ASCII as bytes is just a practical compromise between the older teletype codes and the newer convention.
> But then someone at IBM came up with a new scheme that would be faster when using punch cards in sorting machines. And that became ASCII eventually.
Technically, the punch card processing technology was patented by inventor Herman Hollerith in 1884, and the company he founded wouldn't become IBM until 40 years later (though it was folded with 3 other companies into the Computing-Tabulating-Recording company in 1911, which would then become IBM in 1924).
To be honest though, I'm not clear how ASCII came from anything used by the punch card sorting machines, since it wasn't proposed until 1961 (by an IBM engineer, but 32 years after Hollerith's death). Do you know where I can read more about the progression here?
It's right there in the history section of the wiki page: https://en.wikipedia.org/wiki/ASCII#History
> Work on the ASCII standard began in May 1961, when IBM engineer Bob Bemer submitted a proposal to the American Standards Association's (ASA) (now the American National Standards Institute or ANSI) X3.2 subcommittee.[7] The first edition of the standard was published in 1963,[8] contemporaneously with the introduction of the Teletype Model 33. It later underwent a major revision in 1967,[9][10] and several further revisions until 1986.[11] In contrast to earlier telegraph codes such as Baudot, ASCII was ordered for more convenient collation (especially alphabetical sorting of lists), and added controls for devices other than teleprinters.[11]
Beyond that I think you'd have to dig up the old technical reports.
IBM also notably used EBCDIC instead of ASCII for most of their systems
And just for fun, they also support what must be the most weird encoding system -- UTF-EBCDIC (https://www.ibm.com/docs/en/i/7.5.0?topic=unicode-utf-ebcdic).
Post that stuff with a content warning, would you?
> The base EBCDIC characters and control characters in UTF-EBCDIC are the same single byte codepoint as EBCDIC CCSID 1047 while all other characters are represented by multiple bytes where each byte is not one of the invariant EBCDIC characters. Therefore, legacy applications could simply ignore codepoints that are not recognized.
Dear god.
That says roughly the following when applied to UTF-8:
"The base ASCII characters and control characters in UTF-8 are the same single byte codepoint as ISO-8859-1 while all other characters are represented by multiple bytes where each byte is not one of the invariant ASCII characters. Therefore, legacy applications could simply ignore codepoints that are not recognized."
(I know nothing of EBCDIC, but this seems to mirror UTF-8 design)
*EBCDIC
Thanks, fixed
Fun fact: ASCII was a variable length encoding. No really! It was designed so that one could use overstrike to implement accents and umlauts, and also underline (which still works like that in terminals). I.e., á would be written a BS ' (or ' BS a), à would be written as a BS ` (or ` BS a), ö would be written o BS ", ø would be written as o BS /, ¢ would be written as c BS |, and so on and on. The typefaces were designed to make this possible.
This lives on in compose key sequences, so instead of a BS ' one types compose-' a and so on.
And this all predates ASCII: it's how people did accents and such on typewriters.
This is also why Spanish used to not use accents on capitals, and still allows capitals to not have accents: that would require smaller capitals, but typewriters back then didn't have them.
The use of 8-bit extensions of ASCII (like the ISO 8859-x family) was ubiquitous for a few decades, and arguably still is to some extent on Windows (the standard Windows code pages). If ASCII had been 8-bit from the start, but with the most common characters all within the first 128 integers, which would seem likely as a design, then UTF-8 would still have worked out pretty well.
The accident of history is less that ASCII happens to be 7 bits, but that the relevant phase of computer development happened to primarily occur in an English-speaking country, and that English text happens to be well representable with 7-bit units.
Most languages are well representable with 128 characters (7-bits) if you do not include English characters among those (eg. replace those 52 characters and some control/punctuation/symbols).
This is easily proven by the success of all the ISO-8859-*, Windows and IBM CP-* encodings, and all the *SCII (ISCII, YUSCII...) extensions — they fit one or more languages in the upper 128 characters.
It's mostly CJK out of large languages that fail to fit within 128 characters as a whole (though there are smaller languages too).
Many of the extended characters in ISO 8859-* can be implemented using pure ASCII with overstriking. ASCII was designed to support overstriking for this purpose. Overstriking was how one typed many of those characters on typewriters.
Before this happened, 7-bit ASCII variants based on ISO 646 were widely used.
Historical luck. Though "luck" is probably pushing it in the way one might say certain math proofs are historically "lucky" based on previous work. It's more an almost natural consequence.
Before ASCII there was BCDIC, which was six bits and non-standardized (there were variants, just like technically there are a number of ASCII variants, with the common just referred to as ASCII these days).
BCDIC was the capital English letters plus common punctuation plus numbers. 2^6 is 64, and for capital letters + numbers, you have 36, plus a few common punctuation marks puts you around 50. IIRC the original by IBM was around 45 or something. Slash, period, comma, tc.
So when there was a decision to support lowercase, they added a bit because that's all that was necessary, and I think the printers around at the time couldn't print anything but something less than 128 characters anyway. There wasn't any ó or ö or anything printable, so why support it?
But eventually that yielded to 8-bit encodings (various ASCIIs like latin-1 extended, etc. that had ñ etc.).
Crucially, UTF-8 is only compatible with the 7-bit ASCII. All those 8-bit ASCIIs are incompatible with UTF-8 because they use the eighth bit.
7 bits isn't that odd. Bauddot was 5 bits, and found insufficient, so 6 bit codes were developed; they were found insufficient, so 7-bit ASCII was developed.
IBM had standardized 8-bit bytes on their System/360, so they developed the 8-bit EBCDIC encoding. Other computing vendors didn't have consistent byte lengths... 7-bits was weird, but characters didn't necessarily fit nicely into system words anyway.
I don't really say this to disagree with you, but I feel weird about the phrasing "found insufficient", as if we reevaluated and said 'oops'.
It's not like 5-bit codes forgot about numbers and 80% of punctuation, or like 6-bit codes forgot about having upper and lower case letters. They were clearly 'insufficient' for general text even as the tradeoff was being made, it's just that each bit cost so much we did it anyway.
The obvious baseline by the time we were putting text into computers was to match a typewriter. That was easy to see coming. And the symbols on a typewriter take 7 bits to encode.
Also, statefullness. Baudot has two codes used for switching into one of two modes: figures and letters.
Typewriters have some statefullness, too, like "shift lock". Baudot needed to encode the actions of a type writer to control it, not the output.
In fact, Baudot originally used a 6-bit code and later shortened it to 5.
The idea was that the free bit would be repurposed, likely for parity.
This is not true. ASCII (technically US-ASCII) was a fixed-width encoding of 7 bits. There was no 8th bit reserved. You can read the original standard yourself here: https://ia600401.us.archive.org/23/items/enf-ascii-1968-1970...
Crucially, "the 7-bit coded character set" is described on page 6 using only seven total bits (1-indexed, so don't get confused when you see b7 in the chart!).
There is an encoding mechanism to use 8 bits, but it's for storage on a type of magnetic tape, and even that still is silent on the 8th bit being repurposed. It's likely, given the lack of discussion about it, that it was for ergonomic or technical purposes related to the medium (8 is a power of 2) rather than for future extensibility.
Notably, it is mentioned that the 7-bit code is developed "in anticipation of" ISO requesting such a code, and we see in the addenda attached at the end of the document that ISO began to develop 8-bit codes extending the base 7-bit code shortly after it was published.
So, it seems that ASCII was kept to 7 bits primarily so "extended ASCII" sets could exist, with additional characters for various purposes (such as other languages, but also for things like mathematical symbols).
Mackenzie claims that parity was explicit concern for selecting 7 bit code for ASCII. He cites X3.2 subcommittee, although does not provide any references which document exactly, but considering that he was member of those committees (as far as I can tell) I would put some weight to his word.
https://hcs64.com/files/Mackenzie%20-%20Coded%20Character%20... sections 13.6 and 13.7
When ASCII was invented, 36-bit computers were popular, which would fit five ASCII characters with just one unused bit per 36-bit word. Before, 6-bit character codes were used, where a 36-bit word could fit six of them.
I would love to think this is true, and it makes sense, but do you have any actual evidence for this you could share with HN?
I'm not sure, but it does seem like a great bit of historical foresight. It stands as a lesson to anyone standardizing something: wanna use a 32 bit integer? Make it 31 bits. Just in case. Obviously, this isn't always applicable (e.g. sizes, etc..), but the idea of leaving even the smallest amount of space for future extensibility is crucial.
https://www.sensitiveresearch.com/Archive/CharCodeHist/X3.4-...
Looks to me like serendipity - they thought 8 bits would be wasteful, they didnt have a need for that many characters.