Citizendia
Your Ad Here

Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

In computing, UTF-16 (16-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-7 (7- Bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode -encoded text using a UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UTF-EBCDIC is a Character encoding used to represent Unicode characters The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially BOCU-1 is a MIME compatible Unicode compression scheme BOCU stands for B inary O rdered C ompression for U nicode Punycode is a Computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted An internationalized domain name ( GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which Unicode ’s Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. Many E-mail clients now offer some support for Unicode in E-mail bodies Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits Computing is usually defined like the activity of using and developing Computer technology Computer hardware and software. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication A variable-width encoding is a type of Character encoding scheme in which codes of differing lengths are used to encode a Character set (a repertoire of symbols for A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's The encoding form maps code points (characters) into a sequence of 16-bit words, called code units. In Computing, " word " is a term for the natural unit of data used by a particular computer design For characters in the Basic Multilingual Plane (BMP) the resulting encoding is a single 16-bit word. See also Mapping of Unicode characters The Unicode characters can be categorized in many different ways Unicode code points can be logically divided into 17 For characters in the other planes, the encoding will result in a pair of 16-bit words, together called a surrogate pair. All possible code points from U+0000 through U+10FFFF, except for the surrogate code points U+D800–U+DFFF (which are not characters), are uniquely mapped by UTF-16 regardless of the code point's current or future character assignment or use.

As many uses in computing require units of bytes (octets) there are three related encoding schemes which map to octet sequences instead of words: namely UTF-16, UTF-16BE, and UTF-16LE. A byte (pronounced "bite" baɪt is the basic unit of measurement of information storage in Computer science. In Computing, an octet is a grouping of eight Bits Octet, with the only exception noted below always refers to an entity having exactly eight They differ only in the byte order chosen to represent each 16-bit unit and whether they make use of a Byte Order Mark. A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote All of the schemes will result in either a 2 or 4-byte sequence for any given character.

UTF-16 is officially defined in Annex Q of the international standard ISO/IEC 10646-1. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which It is also described in The Unicode Standard version 3. 0 and higher, as well as in the IETF's RFC 2781.

UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which is a predecessor to UTF-16. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which The UCS-2 encoding form is nearly identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value. As with UTF-16, there are three related encoding schemes (UCS-2, UCS-2BE, UCS-2LE) that map characters to a specific byte sequence.

Because of the technical similarities and upwards compatibility from UCS-2 to UTF-16, the two encodings are often erroneously conflated and used as if interchangeable, so that strings encoded in UTF-16 are sometimes misidentified as being encoded in UCS-2.

For both UTF-16 and UCS-2, all 65,536 code points contained within the BMP (Plane 0), excluding the 2,048 special surrogate code points, are assigned to code units in a one-to-one correspondence with the 16-bit non-negative integers with the same values. Thus code point U+0000 is encoded as the number 0, and U+FFFF is encoded as 65535 (which is FFFF16 in hexadecimal). In Mathematics and Computer science, hexadecimal (also base -, hexa, or hex) is a Numeral system with a

Contents

Encoding of characters outside the BMP

The improvement that UTF-16 made over UCS-2 is its ability to encode characters in planes 1–16, not just those in plane 0 (BMP).

UTF-16 represents non-BMP characters (those from U+10000 through U+10FFFF) using a pair of 16-bit words, known as a surrogate pair. First 1000016 is subtracted from the code point to give a 20-bit value. This is then split into two separate 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate. To allow safe use of simple word-oriented string processing separate ranges of values are used for the two surrogates: 0xD800–0xDBFF for the first, most significant surrogate and 0xDC00-0xDFFF for the second, least significant surrogate. In Computing, " word " is a term for the natural unit of data used by a particular computer design

For example, the character at code point U+10000 becomes the code unit sequence 0xD800 0xDC00, and the character at U+10FFFD, the upper limit of Unicode, becomes the sequence 0xDBFF 0xDFFD. Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code value from a surrogate pair does not ever represent a character.

Byte order encoding schemes

The UTF-16 and UCS-2 encoding forms produce a sequence of 16-bit words or code units. These are not directly usable as a byte or octet sequence because the endianness of these words varies according to the computer architecture; either big-endian or little-endian. To account for this choice of endianness each encoding form defines three related encoding schemes: for UTF-16 there are the schemes UTF-16, UTF-16BE, and UTF-16LE, and for UCS-2 there are the schemes UCS-2, UCS-2BE, and UCS-2LE.

The UTF-16 (and UCS-2) encoding scheme allows either endian representation to be used, but mandates that the byte order should be explicitly indicated by prepending a Byte Order Mark before the first serialized character. A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote This BOM is the encoded version of the Zero-Width No-Break Space (ZWNBSP) character, codepoint U+FEFF, chosen because it should never legitimately appear at the beginning of any character data. This results in the byte sequence FE FF (in hexadecimal) for big-endian architectures, or FF FE for little-endian. The BOM at the beginning of a UTF-16 or UCS-2 encoded data is considered to be a signature separate from the text itself; it is for the benefit of the decoder. Technically, with the UTF-16 scheme the BOM prefix is optional, but omitting it is not recommended as UTF-16LE or UTF-16BE should be used instead. If the BOM is missing, barring any indication of byte order from higher-level protocols, big endian is to be used or assumed. The BOM is not optional in the UCS-2 scheme.

The UTF-16BE and UTF-16LE encoding schemes (and correspondingly UCS-2BE and UCS-2LE) are similar to the UTF-16 (or UCS-2) encoding scheme. However rather than using a BOM prepended to the data, the byte order used is implicit in the name of the encoding scheme (LE for little-endian, BE for big-endian). Since a BOM is specifically not to be prepended in these schemes, if an encoded ZWNBSP character is found at the beginning of any data encoded by these schemes it is not to be considered to be a BOM, but instead is considered part of the text itself. In practice most software will ignore these "accidental" BOMs.

The IANA has approved UTF-16, UTF-16BE, and UTF-16LE for use on the Internet, by those exact names (case insensitively). The Internet Assigned Numbers Authority (IANA is the entity that oversees global IP address allocation, DNS root zone management, media types The Internet is a global system of interconnected Computer networks The aliases UTF_16 or UTF16 may be meaningful in some programming languages or software applications, but they are not standard names in Internet protocols.

Use in major operating systems and environments

UTF-16 is the native internal representation of text in the Microsoft Windows 2000/XP/2003/Vista/CE, Qualcomm BREW operating systems; the Java and .NET bytecode environments; Mac OS X's Cocoa and Core Foundation frameworks; and the Qt cross-platform graphical widget toolkit. Windows 2000 (also referred to as Win2K) is a preemptive, interruptible graphical and business-oriented Operating system designed to work with Windows XP is a family of 32-bit and 64-bit Operating systems produced by Microsoft for use on Personal computers including home and Windows Server 2003 (also referred to as Win2K3 is a server Operating system produced by Microsoft. Windows Vista (ˈvɪstə is a line of Operating systems developed by Microsoft for use on Personal computers including home and business desktops Windows CE (also known officially as Windows Embedded Compact post version 6 BREW ( Binary Runtime Environment for Wireless) is an application development platform created by Qualcomm for mobile phones. Mac OS X (mæk oʊ ɛs tɛn is a line of computer Operating systems developed marketed and sold by Apple Inc, the latest of which is pre-loaded on all currently Cocoa is Apple Inc 's native Object-oriented application program environment for the Mac OS X Operating system Core Foundation (also called CF) is a C Application programming interface (API in Mac OS X, and is a mix of low-level routines and wrapper functions Qt (pronounced "cute" by its creators is a Cross-platform application development framework widely used for the development of GUI programs (in which [1][2]

Symbian OS used in Nokia S60 handsets and Sony Ericsson UIQ handsets uses UCS-2. Symbian OS is an open Operating system, designed for Mobile devices with associated libraries, User interface frameworks and

Older Windows NT systems (prior to Windows 2000) only support UCS-2. Windows NT is a family of Operating systems produced by Microsoft, the first version of which was released in July 1993 [3] The Python language environment has used UCS-2 internally since version 2. Python is a general-purpose High-level programming language. Its design philosophy emphasizes programmer productivity and code readability 1, although newer versions can use UCS-4 (UTF-32) to store supplementary characters (instead of UTF-16).

Examples

code point character UTF-16 code value(s) glyph*
122 (hex 7A) small Z (Latin) 007A z
27700 (hex 6C34) water (Chinese) 6C34
119070 (hex 1D11E) musical G clef D834 DD1E 𝄞
"水z𝄞" (water, z, G clef), UTF-16 encoded
labeled encoding byte order byte sequence
UTF-16LE little-endian 34 6C, 7A 00, 34 D8 1E DD
UTF-16BE big-endian 6C 34, 00 7A, D8 34 DD 1E
UTF-16 little-endian, with BOM FF FE, 7A 00, 34 6C, 34 D8 1E DD
UTF-16 big-endian, with BOM FE FF, 00 7A, 6C 34, D8 34 DD 1E

* Appropriate font and software are required to see the correct glyphs. A clef (from the French for "key" is a musical symbol used to indicate the pitch of written notes.

Example UTF-16 encoding procedure

The character at code point U+64321 (hexadecimal) is to be encoded in UTF-16. Since it is above U+FFFF, it must be encoded with a surrogate pair, as follows:

v  = 0x64321
v′ = v - 0x10000
   = 0x54321
   = 0101 0100 0011 0010 0001

vh = 0101010000 // higher 10 bits of v′
vl = 1100100001 // lower  10 bits of v′
w1 = 0xD800 // the resulting 1st word is initialized with the lower bracket
w2 = 0xDC00 // the resulting 2nd word is initialized with the higher bracket

w1 = w1 | vh
   = 1101 1000 0000 0000 |
            01 0101 0000
   = 1101 1001 0101 0000
   = 0xD950

w2 = w2 | vl
   = 1101 1100 0000 0000 |
            11 0010 0001
   = 1101 1111 0010 0001
   = 0xDF21

The correct UTF-16 encoding for this character is thus the following word sequence:

0xD950 0xDF21

Since the character is above U+FFFF, the character cannot be encoded in UCS-2.

See also

References

  1. ^ Unicode. This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. microsoft. com. Retrieved on 2008-02-01. 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common Events 1327 - Teenaged Edward III is crowned King of England, but the country is ruled by his mother Queen
  2. ^ Surrogates and Supplementary Characters. microsoft. com. Retrieved on 2008-02-01. 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common Events 1327 - Teenaged Edward III is crowned King of England, but the country is ruled by his mother Queen
  3. ^ Description of storing UTF-8 data in SQL Server. microsoft. com (December 7, 2005). Events 43 BC - Marcus Tullius Cicero assassinated 1696 - Connecticut Route 108, one of the oldest highways Year 2005 ( MMV) was a Common year starting on Saturday (link displays full calendar of the Gregorian calendar. Retrieved on 2008-02-01. 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common Events 1327 - Teenaged Edward III is crowned King of England, but the country is ruled by his mother Queen

External links


© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic