Unicode 101: An Introduction to the Unicode Standard

by Beshar Bahjat and Igor Cavalleri

What is Unicode?

In computer systems, characters are transformed and stored as numbers (sequences of bits) that can be handled by the processor. A code page is an encoding scheme that maps a specific sequence of bits to its character representation. The pre-Unicode world was populated with hundreds of different encoding schemes that assigned a number to each letter or character. Many such schemes included code pages that contained only 256 characters - each character requiring 8 bits of storage. While this was relatively compact, it was insufficient to hold ideographic character sets containing thousands of characters such as Chinese and Japanese, and also did not allow the character sets of many languages to co-exist with each other.

Unicode is an attempt to include all the different schemes into one universal text-encoding standard.

The importance of Unicode

Unicode represents a mechanism to support more regionally popular encoding systems - such as the ISO-8859 variants in Europe, Shift-JIS in Japan, or BIG-5 in China.

From a translation/localization point of view, Unicode is an important step towards standardization, at least from a tools and file format standpoint.

  • Unicode enables a single software product or a single website to be designed for multiple platforms, languages and countries (no need for re-engineering) which can lead to a significant reduction in cost over the use of legacy character sets.
  • Unicode data can be used through many different systems without data corruption.
  • Unicode represents a single encoding scheme for all languages and characters.
  • Unicode is a common point in the conversion between other character encoding schemes. Since it is a superset of all of the other common character encoding systems, you can convert from one encoding scheme to Unicode, and then from Unicode to the other encoding scheme.
  • Unicode is the preferred encoding scheme used by XML-based tools and applications.

Bottom line: Unicode is a worldwide character-encoding standard, published by the Unicode Consortium. Computers store numbers that represent a character; Unicode provides a unique number for every character.

The Unicode Standard

The Unicode Standard is the universal character-encoding standard used for representation of text for computer processing.

Versions of the Unicode Standard are fully compatible and synchronized with the corresponding versions of International Standard ISO/IEC 10646, which defines the Universal Character Set character encoding.

In other words, Unicode contains all the same characters and encoding points as ISO/IEC 10646:2003 and provides codes for 96,447 characters, more than enough to decode all of the world's alphabets, ideograms, and symbols.

It is platform, program, and language independent.

However, Unicode is a standard scheme for representing plain text - it is not a scheme for representing rich text.

Acronyms and Definitions

Often, while reading about Unicode you will encounter acronyms such as UCS-*, UTF-*, and BOM. Let's clarify those abbreviations.

UCS-* and UTF-*

The two most common encoding schemes store Unicode text as sequences of either 2 or 4 bytes. The official terms for these encodings are UCS-2 and UCS-4, respectively.

UCS stands for Universal Character Set as specified by ISO/IEC 10646.

The number indicates the number of octets (an octet is 8 bits) in the coded character set. UCS-2 indicates two octets, while UCS-4 indicates four octets.

UTF stands for Unicode Transformation Format.

In order to be compatible with older systems that didn't support Unicode, Encoding Forms were defined by the Unicode Consortium to be a representation of the character in bits.

The number indicates the encoding form that is to be used: UTF-8 indicates an 8-bit encoding form, while UTF-16 indicates a 16-bit encoding form.

What is a BOM?

BOM stands for Byte Order Mark, and is the encoding signature for the file - a particular sequence of bytes at the beginning of the file that indicates the encoding and the byte order.

UTF-32, UTF-16, and UTF-8

Let's take a closer look at the three character encoding forms known as Unicode Transformation Formats (UTF).

UTF-32

UTF-32 represents each 21-bit code point value as a single 32-bit code unit. (In Unicode, code points are 21-bit integers). UTF-32 is optimized for systems where 32-bit values are easier or faster to process and space is not an issue. It is popular where memory space is of little concern, but fixed width, single code unit access to characters is desired.

UTF-16

UTF-16 represents each 21-bit code point value as a sequence of one or two 16-bit code units.

The vast majority of characters are represented with single 16-bit code units, making it a good general-use compromise between UTF-32 and UTF-8. UTF-16 is the oldest Unicode encoding form and is the form specified by the Java and JavaScript programming languages and by XML Document Object Model APIs.

Each code has 16 bits (two bytes, 2*8 = 16 bits). With UTF-16, we distinguish between Big Endian and Little Endian.

Consider the number 1077 (base 16).

Two bytes are necessary to store this number:

one byte stores the most significant part (10)

the other byte stores the least significant part (77).

But the order of the bytes in memory depends on the processor of the computer. In other words, some computers write "10, 77", and others write "77, 10".

The first method is referred to as "Big Endian" and the second as "Little Endian".

Based on that, there are two UTF-16 representations:

UTF 16 Big Endian

UTF 16 Little Endian.

Microsoft Word's Unicode, for example, uses Little Endian, while Big Endian is recommended for the Internet. Files that are encoded as UTF 16 usually start with a code which identifies them as Big Endian or Little Endian.

UTF-8

UTF-8 is the default encoding form for a wide variety of Internet standards and represents each 21-bit code point value as a sequence of one to four 8-bit code units. The W3C (World Wide Web Consortium) specifies that all XML processors must read UTF-8 and UTF-16 encoding.

In order to differentiate between UTF-8 and UTF-16, a BOM must be present and used by programs as an encoding signature.

Since this encoding scheme has gained widespread acceptance, let's take a deeper look into it:

UTF-8 is a compact, efficient Unicode encoding scheme. The encoding scheme distributes a Unicode code value's bit pattern across 1, 2, 3, or even 4 bytes. This encoding is a multi-byte encoding scheme.

UTF-8 encodes ASCII in only 1 byte. This means that languages that use Latin-based scripts can be represented with only 1.1 bytes per character on average. Other languages may require more bytes per character. Only the Asian scripts have significant encoding overhead in UTF-8 as compared to UTF-16.

UTF-8 is optimized for byte-oriented systems or systems where backward compatibility with ASCII is important.

For European languages, UTF-8 is more compact than UTF-16; for Asian languages, UTF-16 is more compact than UTF-8.

UTF-8 is useful for legacy systems that want Unicode support because developers do not need to drastically modify text-processing code. Code that assumes single byte code units typically doesn't fail completely when provided UTF-8 text instead of ASCII or even Latin-1.

Finally, unlike some legacy encoding schemes, UTF-8 is easy to parse. So called "lead" and "trail" bytes are easily distinguished. Moving forward or backwards in a text string is easier in UTF-8 than in many other multi-byte encodings.

Each encoding scheme has its advantages and drawbacks, but all three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

What makes Unicode so beneficial? Arguably that it, in essence, enables any character, letter or symbol to be represented using one single standard.

References

Addison-Wesley, Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard By Richard Gillam (2002)