Citizendia
Your Ad Here

Unicode
Character encodings
UCS
Mapping
Bi-directional text
BOM
Han unification
Unicode and HTML
Unicode and E-mail
Unicode typefaces

Punycode is a computer programming encoding syntax by which a Unicode string of characters can be translated into the more-limited character set permitted in network host names. In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page This article compares Unicode encodings Two situations are considered eight-bit-clean environments and environments like Simple Mail Transfer Protocol that forbid use of UTF-7 (7- Bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode -encoded text using a UTF-1 is a way of transforming ISO 10646/ Unicode into a stream of Bytes Due to the design it is not possible to resynchronise if decoding starts in the middle of a UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. "Compatibility Encoding Scheme for UTF-16 8-Bit" ( CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26. In Computing, UTF-16 (16- Bit Unicode Transformation Format is a variable-length Character encoding for Unicode, capable of encoding UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 Bits for each Unicode Code point. UTF-EBCDIC is a Character encoding used to represent Unicode characters The Standard Compression Scheme for Unicode (SCSU is a Unicode Technical Standard for reducing the number of Bytes needed to represent Unicode text especially BOCU-1 is a MIME compatible Unicode compression scheme BOCU stands for B inary O rdered C ompression for U nicode An internationalized domain name ( GB18030 is the registered Internet name for the official Character set of the People's Republic of China (PRC superseding GB2312. The Universal Character Set (UCS defined by the ISO / IEC 10646 International Standard, is a standard set of characters upon which Unicode ’s Bi-directional text is used as some Writing systems of the world notably the Arabic (including variants such as Nasta'liq) and Hebrew scripts A byte-order mark ( BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space" when that character is used to denote Web pages authored using hypertext markup language ( HTML) may contain multilingual text represented with the Unicode universal character set. Many E-mail clients now offer some support for Unicode in E-mail bodies Unicode typefaces (also known as UCS fonts and Unicode fonts) are Typefaces containing a wide range of characters, letters, Digits In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page A hostname (occasionally also a sitename) is the unique name by which a network-attached device (which could consist of a computer file server network storage device fax The encoding syntax is published on the Internet in Request for Comments 3492. In Computer network Engineering, a Request for Comments (RFC is a Memorandum published by the Internet Engineering Task Force (IETF describing

The encoding is used as part of IDNA, which is a system enabling the use of internationalized domain names in all scripts that are supported by Unicode, where the burden of translation lies entirely with the user application (a web browser for example). An internationalized domain name ( In Computing, Unicode is an Industry standard allowing Computers to consistently represent and manipulate text expressed in most of the world's A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a IDNA performs some significant pre- and post-processing in addition to its use of Punycode. For further information on IDNA including the pre- and post-processing and the spoofing concerns see the Internationalized domain name article. An internationalized domain name (

Contents

Encoding procedure

This section demonstrates the procedure for Punycode encoding, showing how the string "bücher" is encoded as "bcher-kva".

Separation of ASCII characters

First all basic (ASCII) characters in the string are copied directly from input to output skipping over other characters (e. American Standard Code for Information Interchange ( ASCII) g. "bücher" → "bcher"). If and only if there was one or more basic characters copied, an ASCII hyphen is added to the output next (e. g. "bücher" → "bcher-") .

Encoding of non-ASCII character insertions as code numbers

To understand the next part of the encoding process we first need to understand the behaviour of the decoder. The decoder is a state machine with two state variables i and n. i is an index into the string ranging from zero (representing a potential insertion at the start) to the current length of the extended string (representing a potential insertion at the end).

i starts at zero while n starts at 128 (the first non-ASCII code point). The state progression is monotonic. A state change either increments i or if i is at its maximum resets i to zero and increments n. At each state change either the code point denoted by "n" is inserted or it is not inserted.

The code numbers generated by the encoder encode how many possibilities the decoder should skip before an insertion is made. "ü" has code point 252. So before we get to the possibility of inserting ü in position one it is necessary to skip over six potential insertions of each of the 124 preceding non-ASCII code points and one possible insertion (at position zero) of code point 252. That is why it is necessary to tell the decoder to skip a total of (6 × 124) + 1 = 745 possible insertions before getting to the one required.

Re-encoding of code numbers as ASCII sequences

Punycode uses generalized variable-length integers to represent these values. A numeral system (or system of numeration) is a Mathematical notation for representing numbers of a given set by symbols in a consistent manner For example, this is how "kva" is used to represent the code number 745:

A number system with little-endian ordering is used which allows variable-length codes without separate delimiters: a digit lower than a threshold value marks that it is the most-significant digit, hence the end of the number. The threshold value depends on the position in the number and also on previous insertions, to increase efficiency. Correspondingly the weights of the digits (like the third digit from the right in ordinary numbers having a weight 100) varies.

In this case a "number system" with 36 "digits" is used, with the case-insensitive 'a' through 'z' equal to the numbers 0 through 25, and '0' through '9' equal to 26 through 35. Thus "kva", corresponds to "10 21 0". The second digit has a weight of 35 instead of 36 because for three-digit numbers the first (least significant) digit is in the range b–9, "a" would mark the end of the number. Therefore "kva" represents the number 10 + 35 × 21 = 745.

For the insertion of a second special character in "bücher", the first possibility is "büücher" with code "bcher-kvaa", the second "bücüher" with code "bcher-kvab", etc. After "bücherü" with code "bcher-kvae" comes "ýbücher" with code "bcher-kvaf", etc.

To make the encoding and decoding algorithms simple, no attempt has been made to prevent some encoded values from encoding inadmissible Unicode values: however, these should be checked for and detected during decoding.

Compare an ASCII 'punycoded' URL http://xn--tdali-d8a8w.lv/ that includes the Unicode representation of the Latvian "u with a macron", and "n with cedilla", instead of the unmarked base characters: http://tūdaliņ.lv. Latvian language (latviešu valoda is the official state language of Latvia. A macron, from Greek el μακρόv ( makrón) meaning "long" is a Diacritic ¯ placed over or under a Vowel which was originally

Punycode is designed to work across all scripts, and to be self-optimizing by attempting to adapt to the character set ranges within the string as it operates. It is optimized for the case where the string is composed of zero or more ASCII characters and in addition characters from only one other script system, but will cope with any arbitrary Unicode string. Note that for DNS use, the domain name string is assumed to have been normalized using Nameprep and (for top-level domains) filtered against an officially registered language table before being punycoded, and that the DNS protocol sets limits on the acceptable lengths of the output Punycode string. Nameprep is the process of Unicode NFKC normalization, case-folding to lowercase and removal of some generally invisible code points before it is suitable to represent

External links

International Components for Unicode (ICU is an Open source project of mature C / C++ and Java libraries for Unicode support
© 2009 citizendia.org; parts available under the terms of GNU Free Documentation License, from http://en.wikipedia.org
Dapyx Software network: MP3 Explorer | Ebook Manager | Zenithic