Thursday, March 09, 2006

Back to Unicode support in java

Again got confused in unicode. Some of the terms used are:
1. Coded Character Set
A character Set(collection of characters) where each character has been assigned a unique number. E.g., Unicode character set, where every character is assigned a hexadecimal number.
2. Code Points
The numbers that can be used in a coded character set. Valid code points for Unicode character set is : U+0000 to U+10FFFF (Unicode :4 standard)
3. Supplementary Characters
Characters that could not be represented in the original 16-bit design of Unicode. U+0000 to U+FFFF are referred to as Base Multilingual Plane(BMP) and the others are supplementary characters.
4. Character Encoding Scheme
Mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. e.g., UTF-32, UTF-16, and UTF-8
4. Character Encoding
Mapping from a set of characters to sequences of code units. e.g., UTF-8, ISO-8859-1, GB18030, Shift_JIS.

UTF-16
UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.


No comments: