UTF-8: Difference between revisions

Latest revision as of 21:36, 11 September 2006

UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portably encode all languages on this planet.

The following ASCII control characters (Range: 0x00...0x1F, 0x7F) are allowed:

0x0A: Line feed (Unix Way)
0x0C: Form feed (with intrinsic line feed)

UTF-8 Properties

UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).
This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set.
Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character.
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF.
This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
The sorting order of Bigendian UCS-4 byte strings is preserved.
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
Instead, they have the important function to indicate endian-ness of the UTF-8 encoded file.

API Calls

Windows API

MultiByteToWideChar- convert a MultiByte string to a WideChar string

   int
   MultiByteToWideChar ( UINT    CodePage,        // code page
                         DWORD   dwFlags,         // character-type options
                         LPCSTR  lpMultiByteStr,  // address of string to map
                         int     cchMultiByte,    // number of bytes in string
                         LPWSTR  lpWideCharStr,   // address of wide-character buffer
                         int     cchWideChar );   // size of buffer

Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.

ISO API

mbstowcs - convert a multibyte string to a wide character string

mbsrtowcs - convert a multibyte string to a wide character string

   #include <stdlib.h>
   size_t  mbstowcs ( wchar_t* dst, const char* src, size_t maxlen );
   #include <wchar.h>
   size_t  mbsrtowcs  ( wchar_t* dst, const char** src,             size_t maxlen, mbstate_t* ps );
   size_t  mbsnrtowcs ( wchar_t* dst, const char** src, size_t nms, size_t maxlen, mbstate_t* ps );

Interface is very similar to Windows API, but more cryptic thus more diffcult to understand.

Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.

Conversion scheme

Unicode Glyph	Binary Represenation of Glyph in Unicode	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
U-00000000... U-0000007F	00000000 00000000 00000000 0xxxxxxx	0xxxxxxx
U-00000080... U-000007FF	00000000 00000000 00000xxx xxyyyyyy	110xxxxx	10yyyyyy
U-00000800... U-0000FFFF	00000000 00000000 xxxxyyyy yyzzzzzz	1110xxxx	10yyyyyy	10zzzzzz
U-00010000... U-001FFFFF	00000000 000xxxyy yyyyzzzz zzuuuuuu	11110xxx	10yyyyyy	10zzzzzz	10uuuuuu
U-00200000... U-03FFFFFF	000000xx yyyyyyzz zzzzuuuu uuvvvvvv	111110xx	10yyyyyy	10zzzzzz	10uuuuuu	10vvvvvv
U-04000000... U-7FFFFFFF	0xyyyyyy zzzzzzuu uuuuvvvv vvssssss	1111110x	10yyyyyy	10zzzzzz	10uuuuuu	10vvvvvv	10ssssss

Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.

Examples: The Unicode character U+00A9 = 00010 101001 (copyright sign) is encoded in UTF-8 as

 11000010 10101001 = 0xC2 0xA9

and character U+2260 = 0010 001001 100000 (not equal to) is encoded as:

 11100010 10001001 10100000 = 0xE2 0x89 0xA0

Additional Reading

Unicode.org
Glyph tables
Markus G. Kuhn's Unicode Page (University of Cambridge, UK)
UTF-8 sampler (Web browser test)
Codepages used by OS/2 and Windows

@@ Line 1: / Line 1: @@
-==UTF-8==
+'''UTF-8''' stands for '''UCS Transformation Format 8 bit'''. It is a upward compatible way to portably encode all languages on this planet.
-UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portable encode all languages on this planet.
+The following ASCII control characters (Range: 0x00...0x1F, 0x7F) are allowed:
-Another remark:
-The following control characters (Range: 0x00...0x1F, 0x7F) are allowed:
 *0x0A: Line feed (Unix Way)
 *0x0C: Form feed (with intrinsic line feed)
-===UTF-8 has the following properties:===
+==UTF-8 Properties==
-* UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).
-*This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
-*All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character.
-*The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
-*UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
-*The sorting order of Bigendian UCS-4 byte strings is preserved.
-*The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
-Used in [[APE Tag Item]]s
-===Weblinks:===
+* UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). <br />This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
+* All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set.<br />Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character.
-*[http://www.unicode.org/ Unicode.org]
+* The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF.<br />This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
-*[http://www.unicode.org/charts/ Glyph tables]
+* UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
-*[http://www.cl.cam.ac.uk/~mgk25/unicode.html Markus G. Kuhn's Unicode Page] (University of Cambridge, UK)
+* The sorting order of Bigendian UCS-4 byte strings is preserved.
-*[http://www.columbia.edu/kermit/utf8.html UTF-8 sampler] (Web browser test)
+* The bytes 0xFE and 0xFF are never used in the UTF-8 encoding. <br />Instead, they have the important function to indicate endian-ness of the UTF-8 encoded file.
-*[http://www.microsoft.com/typography/unicode/cscp.htm Codepages used by OS/2 and Windows]
+==API Calls==
 ===Windows API===
@@ Line 59: / Line 45: @@
-Interface is very similar to Windows API, but mr crptc t b mr dffclt t ndrstnd.
+Interface is very similar to Windows API, but more cryptic thus more diffcult to understand.
 Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.
-===Conversion scheme===
+==Conversion scheme==
 {|border="1" cellspacing="1"
@@ Line 100: / Line 86: @@
 |-
 |U-00010000... U-001FFFFF
-||<p style="font-family: 'fontname'>00000000 000xxxyy yyyyzzzz zzuuuuuu</p>
+||00000000 000xxxyy yyyyzzzz zzuuuuuu
 ||11110xxx
 ||10yyyyyy
@@ Line 113: / Line 99: @@
 ||10uuuuuu
 ||10vvvvvv
+|-
+|U-04000000... U-7FFFFFFF
+||0xyyyyyy zzzzzzuu uuuuvvvv vvssssss
+||1111110x
+||10yyyyyy
+||10zzzzzz
+||10uuuuuu
+||10vvvvvv
+||10ssssss
+|}
+Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.
+Examples: The Unicode character U+00A9 = 00010 101001 (copyright sign) is encoded in UTF-8 as
+  11000010 10101001 = 0xC2 0xA9
+and character U+2260 = 0010 001001 100000 (not equal to) is encoded as:
+  11100010 10001001 10100000 = 0xE2 0x89 0xA0
+==Additional Reading==
+* [http://www.unicode.org/ Unicode.org]
+* [http://www.unicode.org/charts/ Glyph tables]
+* [http://www.cl.cam.ac.uk/~mgk25/unicode.html Markus G. Kuhn's Unicode Page] (University of Cambridge, UK)
+* [http://www.columbia.edu/kermit/utf8.html UTF-8 sampler] (Web browser test)
+* [http://www.microsoft.com/typography/unicode/cscp.htm Codepages used by OS/2 and Windows]
+[[Category:Technical]]