Zac Anger's Blog

Unicode

13 January, 2016

Unicode is not a 16-bit system. Unlike mapping letters to bits (A => 0100 0001, for example), with unicode, characters are mapped to a theoretical concept called a code point. So it's an... ideal.

This 'code point' ideal is represented by the numbers we've already seen, like U+0639, and such. The numbers are hexadecimal (base 16) (and the U+ prefix is for 'unicode').

So, early on in the life of unicode, the idea was to store those numbers in two bytes each. But then there was some high-school level drama between the high-endian clique and the low-endian gang, so now we have this FF FE or whatever at the beginning--that's the Unicode BOM.

But then, basically because Americans were still using the same restricted set of the Latin alphabet, we came up with UTF-8. That kind of screwed everyone else over, because now we have 0-127 all in one singe byte, and then 128 on up stored in 2-6 bytes. So the characters we use every day in American English still look the same as before (when we used ASCII), but now a whole bunch of other scripts (and symbols) are kind of screwy, because it takes several bytes to use other characters.

Submit a bug report