UTF-8: Difference between revisions

Revision as of 15:59, 27 April 2005

UTF-8

UTF-8 stands for UCS Transformation Format 8 bit. It is a upward compatible way to portable encode all languages on this planet.

Another remark: The following control characters (Range: 0x00...0x1F, 0x7F) are allowed:

0x0A: Line feed (Unix Way)
0x0C: Form feed (with intrinsic line feed)

UTF-8 has the following properties:

UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00 to 0x7F (ASCII compatibility).
This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters >U+007F are encoded as a sequence of several bytes, each of which has the most significant bit set. Therefore, no ASCII byte (0x00 to 0x7F) can appear as part of any other character.
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0x80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
UTF-8 encoded characters may theoretically be up to six bytes long, however 16-bit BMP characters are only up to three bytes long.
The sorting order of Bigendian UCS-4 byte strings is preserved.
The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

Used in APE Tag Items

Weblinks:

Unicode.org
Glyph tables
Markus G. Kuhn's Unicode Page (University of Cambridge, UK)
UTF-8 sampler (Web browser test)
Codepages used by OS/2 and Windows

Windows API

MultiByteToWideChar- convert a MultiByte string to a WideChar string

   int
   MultiByteToWideChar ( UINT    CodePage,        // code page
                         DWORD   dwFlags,         // character-type options
                         LPCSTR  lpMultiByteStr,  // address of string to map
                         int     cchMultiByte,    // number of bytes in string
                         LPWSTR  lpWideCharStr,   // address of wide-character buffer
                         int     cchWideChar );   // size of buffer

Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.

ISO API

mbstowcs - convert a multibyte string to a wide character string

mbsrtowcs - convert a multibyte string to a wide character string

   #include <stdlib.h>
   size_t  mbstowcs ( wchar_t* dst, const char* src, size_t maxlen );
   #include <wchar.h>
   size_t  mbsrtowcs  ( wchar_t* dst, const char** src,             size_t maxlen, mbstate_t* ps );
   size_t  mbsnrtowcs ( wchar_t* dst, const char** src, size_t nms, size_t maxlen, mbstate_t* ps );

Interface is very similar to Windows API, but mr crptc t b mr dffclt t ndrstnd.

Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.

Conversion scheme

Unicode Glyph	Binary Represenation of Glyph in Unicode	Byte 1	Byte 2	Byte 3	Byte 4	Byte 5	Byte 6
U-00000000... U-0000007F	00000000 00000000 00000000 0xxxxxxx	0xxxxxxx
U-00000080... U-000007FF	00000000 00000000 00000xxx xxyyyyyy	110xxxxx	10yyyyyy
U-00000800... U-0000FFFF	00000000 00000000 xxxxyyyy yyzzzzzz	1110xxxx	10yyyyyy	10zzzzzz
U-00010000... U-001FFFFF	00000000 000xxxyy yyyyzzzz zzuuuuuu	11110xxx	10yyyyyy	10zzzzzz	10uuuuuu
U-00200000... U-03FFFFFF	000000xx yyyyyyzz zzzzuuuu uuvvvvvv	111110xx	10yyyyyy	10zzzzzz	10uuuuuu	10vvvvvv

@@ Line 20: / Line 20: @@
 Used in [[APE Tag Item]]s
 ===Weblinks:===
@@ Line 28: / Line 29: @@
 *[http://www.columbia.edu/kermit/utf8.html UTF-8 sampler] (Web browser test)
 *[http://www.microsoft.com/typography/unicode/cscp.htm Codepages used by OS/2 and Windows]
 ===Windows API===
@@ Line 42: / Line 44: @@
 Convert current locale (Multibyte) to Unicode (WideChar) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the LOCAL_MACHINE/CURRENT_USER.
 ===ISO API===
@@ Line 59: / Line 62: @@
 Convert current locale (multibyte) to Unicode (wide character) and then encode to UTF-8 using the simple generic scheme below. Behaviour of function depends on locale settings of the enviroment variable $LC_CTYPE.
+===Conversion scheme===
+{|border="1" cellspacing="1"
+|width="100px"|
+'''''Unicode Glyph'''''
+|width="280px"|
+'''''Binary Represenation of Glyph in Unicode'''''
+|width="65px"|
+'''''Byte 1'''''
+|width="65px"|
+'''''Byte 2'''''
+|width="65px"|
+'''''Byte 3'''''
+|width="65px"|
+'''''Byte 4'''''
+|width="65px"|
+'''''Byte 5'''''
+|width="65px"|
+'''''Byte 6'''''
+|-
+|U-00000000... U-0000007F
+||00000000 00000000 00000000 0xxxxxxx
+||0xxxxxxx
+|-
+|U-00000080... U-000007FF
+||00000000 00000000 00000xxx xxyyyyyy
+||110xxxxx
+||10yyyyyy
+|-
+|U-00000800... U-0000FFFF
+||00000000 00000000 xxxxyyyy yyzzzzzz
+||1110xxxx
+||10yyyyyy
+||10zzzzzz
+|-
+|U-00010000... U-001FFFFF
+||00000000 000xxxyy yyyyzzzz zzuuuuuu
+||11110xxx
+||10yyyyyy
+||10zzzzzz
+||10uuuuuu
+|-
+|U-00200000... U-03FFFFFF
+||000000xx yyyyyyzz zzzzuuuu uuvvvvvv
+||111110xx
+||10yyyyyy
+||10zzzzzz
+||10uuuuuu
+||10vvvvvv