Unicode 101: An Introduction to the Unicode Standard

by Beshar Bahjat and Igor Cavalleri

What is Unicode?

In computer systems, characters are transformed and stored as numbers (sequences of bits) that can be handled by the processor. A code page is an encoding scheme that maps a specific sequence of bits to its character representation. The pre-Unicode world was populated with hundreds of different encoding schemes that assigned a number to each letter or character. Many such schemes included code pages that contained only 256 characters - each character requiring 8 bits of storage. While this was relatively compact, it was insufficient to hold ideographic character sets containing thousands of characters such as Chinese and Japanese, and also did not allow the character sets of many languages to co-exist with each other.

Unicode is an attempt to include all the different schemes into one universal text-encoding standard.

The importance of Unicode

Unicode represents a mechanism to support more regionally popular encoding systems - such as the ISO-8859 variants in Europe, Shift-JIS in Japan, or BIG-5 in China.

From a translation/localization point of view, Unicode is an important step towards standardization, at least from a tools and file format standpoint.

Bottom line: Unicode is a worldwide character-encoding standard, published by the Unicode Consortium. Computers store numbers that represent a character; Unicode provides a unique number for every character.

The Unicode Standard

The Unicode Standard is the universal character-encoding standard used for representation of text for computer processing.

Versions of the Unicode Standard are fully compatible and synchronized with the corresponding versions of International Standard ISO/IEC 10646, which defines the Universal Character Set character encoding.

In other words, Unicode contains all the same characters and encoding points as ISO/IEC 10646:2003 and provides codes for 96,447 characters, more than enough to decode all of the world's alphabets, ideograms, and symbols.

It is platform, program, and language independent.

However, Unicode is a standard scheme for representing plain text - it is not a scheme for representing rich text.

Acronyms and Definitions

Often, while reading about Unicode you will encounter acronyms such as UCS-*, UTF-*, and BOM. Let's clarify those abbreviations.

UCS-* and UTF-*

The two most common encoding schemes store Unicode text as sequences of either 2 or 4 bytes. The official terms for these encodings are UCS-2 and UCS-4, respectively.

UCS stands for Universal Character Set as specified by ISO/IEC 10646.

The number indicates the number of octets (an octet is 8 bits) in the coded character set. UCS-2 indicates two octets, while UCS-4 indicates four octets.

UTF stands for Unicode Transformation Format.

In order to be compatible with older systems that didn't support Unicode, Encoding Forms were defined by the Unicode Consortium to be a representation of the character in bits.

The number indicates the encoding form that is to be used: UTF-8 indicates an 8-bit encoding form, while UTF-16 indicates a 16-bit encoding form.

What is a BOM?

BOM stands for Byte Order Mark, and is the encoding signature for the file - a particular sequence of bytes at the beginning of the file that indicates the encoding and the byte order.

UTF-32, UTF-16, and UTF-8

Let's take a closer look at the three character encoding forms known as Unicode Transformation Formats (UTF).

UTF-32

UTF-32 represents each 21-bit code point value as a single 32-bit code unit. (In Unicode, code points are 21-bit integers). UTF-32 is optimized for systems where 32-bit values are easier or faster to process and space is not an issue. It is popular where memory space is of little concern, but fixed width, single code unit access to characters is desired.

UTF-16

UTF-16 represents each 21-bit code point value as a sequence of one or two 16-bit code units.

The vast majority of characters are represented with single 16-bit code units, making it a good general-use compromise between UTF-32 and UTF-8. UTF-16 is the oldest Unicode encoding form and is the form specified by the Java and JavaScript programming languages and by XML Document Object Model APIs.

Each code has 16 bits (two bytes, 2*8 = 16 bits). With UTF-16, we distinguish between Big Endian and Little Endian.

Consider the number 1077 (base 16).

Two bytes are necessary to store this number:

But the order of the bytes in memory depends on the processor of the computer. In other words, some computers write "10, 77", and others write "77, 10".

The first method is referred to as "Big Endian" and the second as "Little Endian".

Based on that, there are two UTF-16 representations:

Microsoft Word's Unicode, for example, uses Little Endian, while Big Endian is recommended for the Internet. Files that are encoded as UTF 16 usually start with a code which identifies them as Big Endian or Little Endian.

UTF-8

UTF-8 is the default encoding form for a wide variety of Internet standards and represents each 21-bit code point value as a sequence of one to four 8-bit code units. The W3C (World Wide Web Consortium) specifies that all XML processors must read UTF-8 and UTF-16 encoding.

In order to differentiate between UTF-8 and UTF-16, a BOM must be present and used by programs as an encoding signature.

Since this encoding scheme has gained widespread acceptance, let's take a deeper look into it:

Each encoding scheme has its advantages and drawbacks, but all three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

What makes Unicode so beneficial? Arguably that it, in essence, enables any character, letter or symbol to be represented using one single standard.

References

Addison-Wesley, Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard By Richard Gillam (2002)

http://www.unicode.org

http://www.utf-8.com

http://en.wikipedia.org/wiki/UTF-8

http://distributions.linux.com/howtos/Secure-Programs-HOWTO/character-encoding.shtml

http://www.alanwood.net/unicode/htmlunicode.html

http://www.geocities.com/pmpg98_pt/CharacterEncoding.html

http://www-106.ibm.com/developerworks/java/library/j-mer1022.html

http://www.alanwood.net/unicode/

 

العربية |  Deutsch |  English |  Español |  Français |  Italiano |  日本語 |  Português (Brasil) |  中文