UTF-8: Difference between revisions
(Initial commit) |
(Reformat, categorized, kill that funky-wannabe spelling) |
||
(11 intermediate revisions by 3 users not shown) | |||
Line 1: | Line 1: | ||
'''UTF-8''' stands for '''UCS Transformation Format 8 bit'''. It is a upward compatible way to portably encode all languages on this planet. | |||
The following ASCII control characters (Range: 0x00...0x1F, 0x7F) are allowed: | |||
The following control characters (Range: 0x00...0x1F, 0x7F) are allowed: | |||
*0x0A: Line feed (Unix Way) | *0x0A: Line feed (Unix Way) | ||
*0x0C: Form feed (with intrinsic line feed) | *0x0C: Form feed (with intrinsic line feed) | ||
==UTF-8 Properties== | |||
* UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). | * UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility). <br />This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. | ||
* All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set.<br />Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character. | |||
*All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character. | * The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF.<br />This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. | ||
*The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. | * UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. | ||
*UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long. | * The sorting order of Bigendian UCS-4 byte strings is preserved. | ||
*The sorting order of Bigendian UCS-4 byte strings is preserved. | * The bytes 0xFE and 0xFF are never used in the UTF-8 encoding. <br />Instead, they have the important function to indicate endian-ness of the UTF-8 encoded file. | ||
*The bytes 0xFE and 0xFF are never used in the UTF-8 encoding. | |||
==API Calls== | |||
=== | |||
===Windows API=== | ===Windows API=== | ||
Line 42: | Line 30: | ||
Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER. | Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER. | ||
===ISO API=== | ===ISO API=== | ||
Line 56: | Line 45: | ||
Interface is very similar to Windows API, but | Interface is very similar to Windows API, but more cryptic thus more diffcult to understand. | ||
Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE. | Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE. | ||
==Conversion scheme== | |||
{|border="1" cellspacing="1" | |||
|width="100px"| | |||
'''''Unicode Glyph''''' | |||
|width="280px"| | |||
'''''Binary Represenation of Glyph in Unicode''''' | |||
|width="65px"| | |||
'''''Byte 1''''' | |||
|width="65px"| | |||
'''''Byte 2''''' | |||
|width="65px"| | |||
'''''Byte 3''''' | |||
|width="65px"| | |||
'''''Byte 4''''' | |||
|width="65px"| | |||
'''''Byte 5''''' | |||
|width="65px"| | |||
'''''Byte 6''''' | |||
|- | |||
|U-00000000... U-0000007F | |||
||00000000 00000000 00000000 0xxxxxxx | |||
||0xxxxxxx | |||
|- | |||
|U-00000080... U-000007FF | |||
||00000000 00000000 00000xxx xxyyyyyy | |||
||110xxxxx | |||
||10yyyyyy | |||
|- | |||
|U-00000800... U-0000FFFF | |||
||00000000 00000000 xxxxyyyy yyzzzzzz | |||
||1110xxxx | |||
||10yyyyyy | |||
||10zzzzzz | |||
|- | |||
|U-00010000... U-001FFFFF | |||
||00000000 000xxxyy yyyyzzzz zzuuuuuu | |||
||11110xxx | |||
||10yyyyyy | |||
||10zzzzzz | |||
||10uuuuuu | |||
|- | |||
|U-00200000... U-03FFFFFF | |||
||000000xx yyyyyyzz zzzzuuuu uuvvvvvv | |||
||111110xx | |||
||10yyyyyy | |||
||10zzzzzz | |||
||10uuuuuu | |||
||10vvvvvv | |||
|- | |||
|U-04000000... U-7FFFFFFF | |||
||0xyyyyyy zzzzzzuu uuuuvvvv vvssssss | |||
||1111110x | |||
||10yyyyyy | |||
||10zzzzzz | |||
||10uuuuuu | |||
||10vvvvvv | |||
||10ssssss | |||
|} | |||
Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence. | |||
Examples: The Unicode character U+00A9 = 00010 101001 (copyright sign) is encoded in UTF-8 as | |||
11000010 10101001 = 0xC2 0xA9 | |||
and character U+2260 = 0010 001001 100000 (not equal to) is encoded as: | |||
11100010 10001001 10100000 = 0xE2 0x89 0xA0 | |||
==Additional Reading== | |||
* [http://www.unicode.org/ Unicode.org] | |||
* [http://www.unicode.org/charts/ Glyph tables] | |||
* [http://www.cl.cam.ac.uk/~mgk25/unicode.html Markus G. Kuhn's Unicode Page] (University of Cambridge, UK) | |||
* [http://www.columbia.edu/kermit/utf8.html UTF-8 sampler] (Web browser test) | |||
* [http://www.microsoft.com/typography/unicode/cscp.htm Codepages used by OS/2 and Windows] | |||
[[Category:Technical]] |
Latest revision as of 21:36, 11 September 2006
UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portably encode all languages on this planet.
The following ASCII control characters (Range: 0x00...0x1F, 0x7F) are allowed:
- 0x0A: Line feed (Unix Way)
- 0x0C: Form feed (with intrinsic line feed)
UTF-8 Properties
- UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).
This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8. - All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set.
Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character. - The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF.
This allows easy resynchronization and makes the encoding stateless and robust against missing bytes. - UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
- The sorting order of Bigendian UCS-4 byte strings is preserved.
- The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.
Instead, they have the important function to indicate endian-ness of the UTF-8 encoded file.
API Calls
Windows API
MultiByteToWideChar- convert a MultiByte string to a WideChar string
int MultiByteToWideChar ( UINT CodePage, // code page DWORD dwFlags, // character-type options LPCSTR lpMultiByteStr, // address of string to map int cchMultiByte, // number of bytes in string LPWSTR lpWideCharStr, // address of wide-character buffer int cchWideChar ); // size of buffer
Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.
ISO API
mbstowcs - convert a multibyte string to a wide character string
mbsrtowcs - convert a multibyte string to a wide character string
#include <stdlib.h> size_t mbstowcs ( wchar_t* dst, const char* src, size_t maxlen ); #include <wchar.h> size_t mbsrtowcs ( wchar_t* dst, const char** src, size_t maxlen, mbstate_t* ps ); size_t mbsnrtowcs ( wchar_t* dst, const char** src, size_t nms, size_t maxlen, mbstate_t* ps );
Interface is very similar to Windows API, but more cryptic thus more diffcult to understand.
Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.
Conversion scheme
Unicode Glyph |
Binary Represenation of Glyph in Unicode |
Byte 1 |
Byte 2 |
Byte 3 |
Byte 4 |
Byte 5 |
Byte 6 |
U-00000000... U-0000007F | 00000000 00000000 00000000 0xxxxxxx | 0xxxxxxx | |||||
U-00000080... U-000007FF | 00000000 00000000 00000xxx xxyyyyyy | 110xxxxx | 10yyyyyy | ||||
U-00000800... U-0000FFFF | 00000000 00000000 xxxxyyyy yyzzzzzz | 1110xxxx | 10yyyyyy | 10zzzzzz | |||
U-00010000... U-001FFFFF | 00000000 000xxxyy yyyyzzzz zzuuuuuu | 11110xxx | 10yyyyyy | 10zzzzzz | 10uuuuuu | ||
U-00200000... U-03FFFFFF | 000000xx yyyyyyzz zzzzuuuu uuvvvvvv | 111110xx | 10yyyyyy | 10zzzzzz | 10uuuuuu | 10vvvvvv | |
U-04000000... U-7FFFFFFF | 0xyyyyyy zzzzzzuu uuuuvvvv vvssssss | 1111110x | 10yyyyyy | 10zzzzzz | 10uuuuuu | 10vvvvvv | 10ssssss |
Only the shortest possible multibyte sequence which can represent the code number of the character can be used. Note that in multibyte sequences, the number of leading 1 bits in the first byte is identical to the number of bytes in the entire sequence.
Examples: The Unicode character U+00A9 = 00010 101001 (copyright sign) is encoded in UTF-8 as
11000010 10101001 = 0xC2 0xA9
and character U+2260 = 0010 001001 100000 (not equal to) is encoded as:
11100010 10001001 10100000 = 0xE2 0x89 0xA0
Additional Reading
- Unicode.org
- Glyph tables
- Markus G. Kuhn's Unicode Page (University of Cambridge, UK)
- UTF-8 sampler (Web browser test)
- Codepages used by OS/2 and Windows