UTF-8

From Hydrogenaudio Knowledgebase

UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portably encode all languages on this planet.

The following ASCII control characters (Range: 0x00...0x1F, 0x7F) are allowed:

  • 0x0A: Line feed (Unix Way)
  • 0x0C: Form feed (with intrinsic line feed)


UTF-8 Properties

  • UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).
    This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
  • All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set.
    Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character.
  • The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF.
    This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
  • UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
  • The sorting order of Bigendian UCS-4 byte strings is preserved.
  • The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
    Instead, they have the important function to indicate endian-ness of the UTF-8 encoded file.

API Calls

Windows API

MultiByteToWideChar- convert a MultiByte string to a WideChar string

   int
   MultiByteToWideChar ( UINT    CodePage,        // code page
                         DWORD   dwFlags,         // character-type options
                         LPCSTR  lpMultiByteStr,  // address of string to map
                         int     cchMultiByte,    // number of bytes in string
                         LPWSTR  lpWideCharStr,   // address of wide-character buffer
                         int     cchWideChar );   // size of buffer

Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.


ISO API

mbstowcs - convert a multibyte string to a wide character string

mbsrtowcs - convert a multibyte string to a wide character string

   #include <stdlib.h>
   size_t  mbstowcs ( wchar_t* dst, const char* src, size_t maxlen );
   #include <wchar.h>
   size_t  mbsrtowcs  ( wchar_t* dst, const char** src,             size_t maxlen, mbstate_t* ps );
   size_t  mbsnrtowcs ( wchar_t* dst, const char** src, size_t nms, size_t maxlen, mbstate_t* ps );


Interface is very similar to Windows API, but more cryptic thus more diffcult to understand.

Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.


Conversion scheme

Unicode Glyph

Binary Represenation of Glyph in Unicode

Byte 1

Byte 2

Byte 3

Byte 4

Byte 5

Byte 6

U-00000000... U-0000007F 00000000 00000000 00000000 0xxxxxxx 0xxxxxxx
U-00000080... U-000007FF 00000000 00000000 00000xxx xxyyyyyy 110xxxxx 10yyyyyy
U-00000800... U-0000FFFF 00000000 00000000 xxxxyyyy yyzzzzzz 1110xxxx 10yyyyyy 10zzzzzz
U-00010000... U-001FFFFF 00000000 000xxxyy yyyyzzzz zzuuuuuu 11110xxx 10yyyyyy 10zzzzzz 10uuuuuu
U-00200000... U-03FFFFFF 000000xx yyyyyyzz zzzzuuuu uuvvvvvv 111110xx 10yyyyyy 10zzzzzz 10uuuuuu 10vvvvvv
U-04000000... U-7FFFFFFF 0xyyyyyy zzzzzzuu uuuuvvvv vvssssss 1111110x 10yyyyyy 10zzzzzz 10uuuuuu 10vvvvvv 10ssssss


Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

Examples: The Unicode character U+00A9 = 00010 101001 (copyright sign) is encoded in UTF-8 as

 11000010 10101001 = 0xC2 0xA9 

and character U+2260 = 0010 001001 100000 (not equal to) is encoded as:

 11100010 10001001 10100000 = 0xE2 0x89 0xA0


Additional Reading