Unconfusing Unicode: What is Unicode?
Unicode can be very confusing and I see a lot of questions and problems based on the same misconceptions. They assume that Unicode is an encoding or a character set, which is kinda correct but only in various extremely unhelpful ways. These misconceptions are not helped by Unicode originally being designed as both a character set and an encoding. So this is an attempt to clear things up, not by explaining how it is according to strict definitions, but by giving you a mental model to Unconfuse Unicode.
Here is an utterly incorrect but still Useful Mental Model of Unicode (hereafter called UMMU):
Unicode is a way to handle textual data.
It’s not a character set, it’s not an encoding.
Unicode is text, everything else is binary.
Yes, even ASCII text is binary data.
Unicode uses the UCS character set.
But UCS is not Unicode.
Unicode can be encoded to binary data with UTF.
But UTF is not Unicode.
Now, if you know a bit about Unicode, you will go “Uhm, well, yeah, that’s not really true”. So the rest of this post will take a more in-depth and less in-accurate discussion on the issue and explain why the mental model is still useful, even though it’s wrong. Let’s start with character sets:
About character sets
For computers to be able to handle text you need to map graphemes, the squiggly things you write on paper, to numbers. Such a mapping define up a set of characters, and give each a number, and is called a “character set”. Characters doesn’t have to map specifically to graphemes. There are for example control characters like “BEL” that will make your computer go “Ping!” The number of characters in a character set is usually 256, because that’s how many will fit in a byte. There are character sets that are only six bits, and for a long time the seven-bit ASCII character set dominated computing, but 8-bit is the most common.
But 256 characters doesn’t go a long way to cover all the characters you need in the world, and that is the reason for Unicode. When I said that Unicode is not a character set, I lied. Unicode was originally a 16-bit character set, but although the Unicode people was of the opinion that 16-bits was enough (and they were right about that) other people was of the opinion that its wasn’t enough at all (and they were also right). The other people ended up creating a competing character set called the Universal Character Set, UCS. After a while the groups decided to merge the efforts so the so the two character sets became equivalent. The end result of this, for the purpose of UMMU, is that you can see UCS as the character set used by Unicode. That’s also a lie, in fact both standards define their own character set, they just happen to be the same, (although the UCS standard tends to lag behind).
But for the purpose of UMMU, Unicode is not a character set.
UCS is a 31-bit character set containing over a 100 000 characters. It is 31-bits because then you don’t have to deal with signed vs unsigned integer problems. Since they are currently only using less than 5 thousands of a percent of the total available code space, that extra bit isn’t exactly necessary anyway.
Although 16 bits was not enough to cover all characters in human history, 16 bits are enough as long as you limit yourself to common use of non-dead scripts. So most of the characters of UCS are crammed into the first 65536 code points. That’s called the “Basic Multilingual Plane” or BMP and is essentially the original 16-bit Unicode character set, although each version of UCS has of course extended it with more characters. The BMP becomes relevant when talking about encodings, further down.
Each character in UCS has a name and a number. The character ‘H’ has the name ‘LATIN CAPITAL LETTER H’ and the number 72. The number is usually expressed in hexadecimal, and often using ‘U+’ and 4, 5 or 6 digits to show that it’s meant to be a Unicode character. So the ‘H’ characters number is usually written as U+0048, rather than 72, but it means the same. Other examples of characters are ‘—’, named ‘EM DASH’, or U+2012. The ‘乨’ character got the name ‘CJK UNIFIED IDEOGRAPH-4E68’ often written as U+4E68. The character ‘𐌰’ is called ‘GOTHIC LETTER AHSA’ and is written U+10330, using five digits as it’s outside the BMP.
Since the names and numbers for each character in Unicode and UCS are the same for the purpose of UMMU we’ll say UCS is not Unicode, but Unicode uses UCS. That’s a lie, but it’s a useful lie that enables a separation of concerns between Unicode and the character set.
So, a character set is a bunch of characters that each have a number. But how should you store this, or send it to another computer? For an 8-bit character set it’s easy, you just send one byte per character, but UCS is 31-bit so you need four bytes per character, and that’s both inefficient and gives you problems with endianness. Also, not all other software can handle all the characters that exist in Unicode, but we still need to talk to them.
The solution is to use encodings; ways to convert Unicode text data to 8-bit binary data. Notable here is that ASCII is an encoding, and that ASCII data therefore from the viewpoint of UMMU is binary data. You might be tricked into believing it’s text, as it sure looks like it, but don’t believe it! In UMMU only Unicode is text, everything else is binary data!
In most cases an encoding is also a character set, and named after the character set it encodes. This goes for Latin-1, ISO-8859-15, cp1252, ASCII, etc. Although most character sets also are encodings, this get’s confusing with UCS, which is a character set but not an encoding. And it’s also confusing because UCS is something you decode to or encode from, while all other character sets are something you decode from or encode to. So therefore you need think of character sets and encodings as different things, even though the words are often used interchangeably.
Since an encoding is a way to convert Unicode text to a binary sequence for the purpose of storage or transfer that means that Unicode is not an encoding! And neither is UCS.
Most encodings are tied to a character set that only is a small a subset of UCS. This becomes a problem for multilingual data, so therefore an encoding is needed that encompasses all of UCS. Encoding of an 8-bit character set is trivial as you get one character per byte, but UCS is 31-bit so you need four bytes to encode it. That gives you byte ordering problems as some systems are big-endian and other little-endian, and in addition almost all of those bytes would always be zero, so it’s a big waste of space. You can make more clever encodings by having different lengths for different characters, but then the encoding becomes efficient for some scripts, but inefficient for the others.
The solution to these conundrums is to have several different encodings you can choose from. They are called Unicode Transformation Formats, or UTF.
UTF-8 is the most common encoding on the Internet, and it will map all of ASCII as one byte, and all other characters in UCS as between 2 to 4 bytes. That’s very efficient for languages using scripts derived from Latin that are mostly ASCII and it’s reasonably efficient for Greek, Coptic, Cyrillic, Hebrew, Armenian, Arabic, Syriac and Tāna, which will use two bytes per character. It’s inefficient for everything else in the Basic Multilingual Plane, because you’ll end up using three bytes per character, and you’ll need four bytes for the rest of the things in UCS like the Gothic script.
UTF-16 will map all characters in the Basic Multilingual Plane as one 16-bit word, and all other UCS characters as two 16-bit words. So if the script you are using isn’t amongst the ones mentioned above, then UTF-16 is more efficient. But since UTF-16 uses 16-bit words, you get byte order problems. This is solved by having three variants, UTF-16BE for big-endian, UTF-16LE for little-endian and just plain UTF-16, that can be either UTF-16BE or UTF-16LE and where the encoding starts with a marker to tell you which it is. That marker is called a byte order mark, or a “BOM”.
There is also UTF-32, which have the same BE and LE variants as UTF-16, and will store each Unicode character as one 32-bit integer. This is inefficient for almost any script except the extinct ones that uses 4 bytes no matter what, but it’s easy to handle computationally as you always have 4 bytes per character.
For practical purposes (that I intend to discuss more in a later installment) it’s important to separate the encoded data from the Unicode data. Therefore it is important to not think of UTF-8/16/32 data as Unicode. So although the UTF encodings are defined in the Unicode standard, for the purposes of UMMU we’ll say that UTF is not Unicode.
UCS has combining characters, such as diaeresis, which puts two dots above the character. This leads to ambiguity, as you can express the same grapheme (letter or sign) by using different characters. Take for example ‘ö’, which can be represented both as the character LATIN SMALL LETTER O WITH DIAERESIS but also by two characters, LATIN SMALL LETTER O followed by COMBINING DIAERESIS.
But in real life you can’t just follow any character with a diaeresis. There is no sense in putting two dots above a Euro sign, for example. So Unicode adds rules for these things, saying that indeed, you can express ‘ö’ in both ways, and that they should indeed be equivalent, but if you stick a combing diaeresis after a Euro sign you probably didn’t really mean it. So which characters can successfully be combined by other characters is a part of the Unicode standard.
There is also collating rules (us normal humans call it sorting) as well as rules on how to split a text into sentences and words (if you think that’s easy, consider the fact that many Asian languages have no spaces between words) and loads of other rules of things that are immensely subtle when it comes to how scripts are written and handled. Usually you don’t need to know about these things, unless they bite you, which is unusual, although apparently less unusual with Asian scripts.
So, the UMMU view here is Unicode is UCS plus rules of how to handle text. Or in other words: Unicode is a way to handle textual data, no matter what script they are in. In Unicode ‘H’ is not just a character, it has some sort of meaning. A character set only says that the character with number 72 is an ‘H’, but Unicode will tell you that ‘H’ should be sorted before an ‘I’ and that you can put two dots over it to make an ‘Ḧ’.
So Unicode is not an encoding, and not a character set, it’s a way.