Mojibake is the phenomenon of incorrect, unreadable characters (garbage characters) shown when computer software fails to render a text correctly according to its associated character encoding. A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page It is a loanword from Japanese. A loanword (or loan word) is a word directly taken into one Language from another with little or no translation is a language spoken by over 130 million people in Japan and in Japanese emigrant communities
Contents |
The Japanese word 文字化け (mojibake, [moʥibake]) is composed of 文字 (moji), which means letter, character, and 化け (bake), from the verb 化ける (bakeru), which means to appear in disguise, to take the form of, to change for the worse. Literally, it means "character changing".
Mojibake is often caused by forced display of writing systems or character encodings that are "foreign" to the user's computer system: if a computer does not have the software required to process a foreign language's characters, it will attempt to process them in its default language encoding, usually resulting in gibberish. A writing system is a type of Symbolic system used to represent elements or statements expressible in Language. A character encoding consists of a code that pairs a sequence of characters from a given character set (sometimes incorrectly referred to as Code page Messages transferred between different encodings of the same language can also have mojibake problems. Japanese language users, with several different encodings historically employed, would encounter this problem relatively often. For example, the intended word "文字化け", encoded in UTF-8, is incorrectly displayed as "æ–‡å—化ã‘" in software that is configured to expect text in the Windows-1252 or ISO-8859-1 encodings, usually labeled Western. UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. Windows-1252 (also known as WinLatin1) is a Character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet.
A web browser may not be able to distinguish a page coded in EUC-JP and another in Shift-JIS if the coding scheme is not assigned explicitly using the HTTP headers sent along with the documents, or using the HTML document's meta tags that are used to substitute for missing HTTP headers if the server cannot be configured to send the proper HTTP headers. A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a Extended Unix Code ( EUC) is a multibyte Character encoding system used primarily for Japanese, Korean, and Simplified Chinese. Hypertext Transfer Protocol ( HTTP) is a Communications protocol for the transfer of information on the Internet. HTML, an initialism of HyperText Markup Language, is the predominant Markup language for Web pages It provides a means to describe the structure Meta elements are HTML or XHTML elements used to provide structured Metadata about a Web page. Heuristics can be applied to guess at the character set, but these are not always successful.
In the mid 1990s, as this problem became common, several websites featured mojibake not as a problem to be tackled but simply for amusement. Words and even sentences were "deciphered" with meanings made up to deliver funny messages.
Mojibake can also occur between what appears to be the same encodings. For example, Windows and Macintosh both use the name ISO-8859-1 for a character encoding. Microsoft Windows is a series of Software Operating systems and Graphical user interfaces produced by Microsoft. Macintosh, commonly nicknamed Mac is a Brand name which covers several lines of Personal computers designed developed and marketed by Apple Inc ISO 8859-1, more formally cited as ISO/IEC 8859-1 is part 1 of ISO/IEC 8859, a standard Character encoding of the Latin alphabet. However, each system includes characters in their encoding not in ISO-8859-1, and these are not compatible across systems. Many people are unaware of these extra characters and use them in websites, e-mails, blogs, and so on as common characters, and as a result, mojibake occurs. A website (alternatively web site or Web site, a back-construction from the Proper noun World Wide Web) is a collection of Web pages Electronic mail, often abbreviated to e-mail, email, or originally eMail, is a Store-and-forward method of writing sending receiving A blog (a contraction of the term " Web log " is a Web site, usually maintained by an individual with regular entries of commentary descriptions of
The difficulty of resolving an instance of mojibake varies depending on the application within which it occurs and the causes of it. Two of the most common applications in which mojibake may occur are web browsers and word processors. A web browser is a software application which enables a user to display and interact with text images videos music games and other information typically located on a Modern browsers and word processors often support a wide array of character encodings. Browsers would often allow a user to change its rendering engine's encoding setting on the fly, while word processors would allow the user to select the appropriate encoding when opening a file. It may take some trial and error for users to find the correct encoding. Trial and error, or trial by error, is a general method of Problem solving for obtaining Knowledge, both Propositional knowledge and Know-how
The problem gets more complicated when it occurs in an application that normally does not support a wide range of character encoding, such as in a non-Unicode computer game. In this case, the user must change the operating system's encoding settings to match that of the game. However, changing the system-wide encoding settings can also cause Mojibake in pre-existing applications. In Windows XP or later, a user also has the option to use Microsoft AppLocale, an application that allows the changing of per-application locale settings. Windows XP is a family of 32-bit and 64-bit Operating systems produced by Microsoft for use on Personal computers including home and AppLocale is a tool for Windows XP and Windows Server 2003 by Microsoft. Even so, changing the operating system encoding settings is not possible on earlier operating systems such as Windows 98; to resolve this issue on earlier operating systems, a user would have to use third party font rendering applications. Windows 98 ( codenamed Memphis) is a graphical Operating system released on 25 June 1998 by Microsoft and the successor to Windows 95
Mojibake rarely happens in English, since most Encodings agree with ASCII on the encoding of the English alphabet. The modern English alphabet consists of 26 letters derived from the Latin alphabet: History See also History of the
In Japanese, the phenomenon is as mentioned called mojibake 文字化け. is a language spoken by over 130 million people in Japan and in Japanese emigrant communities It is often encountered by non-Japanese when attempting to run software written for the Japanese market.
In Chinese, this phenomenon is called luanma simplified Chinese: 乱码; traditional Chinese: 亂碼; pinyin: luànmǎ; literally "haphazard code". Pinyin, more formally Hanyu pinyin, is the most common Standard Mandarin Romanization system in use
Users of Central and Eastern European languages can also be affected. Central Europe is the Region lying between the variously and vaguely defined areas of Eastern and Eastern Europe is a general term that refers to the Geopolitical region encompassing the easternmost part of the European continent. Because most computers were not connected to any network, during the mid- to late 1980s there were different character encodings for every language with diacritical characters.
In Russian, mojibake is called krokozyabry (крокозя́бры). Russian ( transliteration:,) is the most geographically widespread language of Eurasia, the most widely spoken of the Slavic languages During the 1990s, several different encodings for the Cyrillic alphabet (Unix KOI8-R, Windows CP-1251, DOS 866, standard ISO 8859-5, and several others) competed. The Cyrillic alphabet (səˈrɪlɪk also called azbuka, from the old name of the first two letters is actually a family of Alphabets, subsets of which are used by KOI8-R is an 8-bit Character encoding, designed to cover Russian, which uses the Cyrillic alphabet Windows-1251 is a popular 8-bit Character encoding, designed to cover languages that use the Cyrillic alphabet such as Russian, Bulgarian and CP866 is a Cyrillic Code page to be used with MS-DOS. It is based on the "alternative character set" of GOST 19768-87 ISO 8859-5, also known as Cyrillic is an 8-bit Character encoding, part of the ISO 8859 standard Poorly configured servers and lack of compatibility made garbled text a common and frustrating experience. Many e-mail servers stripped the 8th bit from the characters as permitted by earlier standards (which renders UTF-8 unreadable, as well as all of the above). UTF-8 (8- Bit UCS / Unicode Transformation Format) is a variable-length Character encoding for Unicode. For this reason many Cyrillic users resorted to Volapuk encoding. Volapuk encoding ( Russian: кодировка "волапюк" kodirovka "volapyuk") is a slang term for rendering the letters of the Cyrillic An even more frustrating problem emerged in the early 2000s, when the popular e-mail client Microsoft Outlook started to replace correctly entered Cyrillic characters with question marks when replying to or forwarding messages created in competing encodings. Microsoft Outlook or Outlook (full name Microsoft Office Outlook since Outlook 2003 is a Personal information manager from Microsoft, and is
In Bulgarian, mojibake is often called maymunitsa (маймуница), meaning monkey's alphabet. Bulgarian (български език IPA: ɛzˈik is an Indo-European language, a member of the Slavic linguistic group In Serbian, it is called ђубре (đubre), meaning trash. Serbian (sr-Cyrl српски језик sr-Latn ''srpski jezik'' is a South Slavic language, In German, Zeichensalat (character salad) is a common term for this phenomenon. The German language (de ''Deutsch'') is a West Germanic language and one of the world's major languages.
In Poland every company selling early DOS computers created its own encoding, and simply reprogrammed the EPROMs of the video cards (typically CGA, EGA or Hercules) with the according character shapes. Poland (Polska officially the Republic of Poland DOS, short for "Disk Operating System" is a shorthand term for several closely related Operating systems that dominated the IBM PC compatible market An EPROM, or E rasable P rogrammable '''''R'''ead-'''O'''nly '''M'''emory'', is a type of memory chip that retains its The Color Graphics Adapter ( CGA) originally also called the Color/Graphics Adapter or IBM Color/Graphics Monitor Adapter The Enhanced Graphics Adapter (EGA is the IBM PC Computer display standard specification located between CGA and VGA in terms of graphics The Hercules Graphics Card ( HGC) was a computer Graphics controller which through its popularity became a widely supported display standard. Additionally, users of then-popular home computers (such as the Atari ST) invented their own encodings, incompatible with international standards (ISO 8859-2), vendor standards (IBM CP852, Windows CP1250) and locally agreed-upon PC/MS DOS standards (Mazovia). The Atari ST is a home / Personal computer that was commercially available from 1985 to the early 1990s ISO 8859-2, more formally cited as ISO/IEC 8859-2 or less formally as Latin-2, is part 2 of ISO/IEC 8859, a standard Character encoding defined by Code page 852 (CP 852 IBM 852 OEM 852 is a Code page to be used under MS-DOS with Central European languages that use Latin script (such as Windows-1250 is a Code page used under Microsoft Windows to represent texts in Central European and Eastern European languages that use Latin Mazovia encoding is used under MS-DOS to represent Polish texts The situation began to improve when, after pressure from academic and user groups, ISO 8859-2 succeeded as the "Internet standard" with limited support of the dominant vendors' software (today largely replaced by Unicode). ISO 8859-2, more formally cited as ISO/IEC 8859-2 or less formally as Latin-2, is part 2 of ISO/IEC 8859, a standard Character encoding defined by With the numerous problems caused by the variety of encodings, even today some users tend to refer to Polish diacritical characters as krzaki ("bushes").
In Nordic languages such as Finnish, Swedish, Danish and Norwegian, mojibake is not uncommon, but isn't much of a problem and is more of an annoyance. Finnish ( or suomen kieli) is the language spoken by the majority of the population in Finland (92% As of 2006) and by ethnic Finns outside Swedish ( is a North Germanic language spoken by more than nine million people predominantly in Sweden and parts of Finland, especially along the Danish ( d̥ænsɡ̊ is one of the North Germanic languages (also called Scandinavian languages a sub-group of the Germanic branch of the Norwegian ( norsk) is a North Germanic Language spoken primarily in Norway, where it is an official language E. g. Finnish and Swedish use the English alphabet and three more characters (åäö), and typically these three are the only ones that become corrupted. Finnish ( or suomen kieli) is the language spoken by the majority of the population in Finland (92% As of 2006) and by ethnic Finns outside Swedish ( is a North Germanic language spoken by more than nine million people predominantly in Sweden and parts of Finland, especially along the The modern English alphabet consists of 26 letters derived from the Latin alphabet: History See also History of the Being vowels these are rarely repeated (in Swedish, Norwegian and Danish), and it is usually obvious when one character gets corrupted, e. g. the second letter in "kärlek" (kärlek, "love"). This way, even though the reader has to guess between å, ä and ö, almost all texts remain perfectly readable. However, Finnish does have repeating vowels and words like "Hääyö" (hääyö) this can sometimes render text very hard to read. Finnish ( or suomen kieli) is the language spoken by the majority of the population in Finland (92% As of 2006) and by ethnic Finns outside In both Norwegian and Danish, the three letters that set off the phenomenon are æ, ø and å.
Another type of mojibake occurs when text is erroneously parsed in a multi-byte encoding, such as one of the east asian encodings. With this kind of mojibake more than one (typically two) characters is corrupted at once, e. g "k舐lek" (kärlek) in Swedish, where "är" is parsed as "舐". Compared to the above mojibake, this is harder to read for humans since now also letters unrelated to the problematic åäö are missing, and is especially problematic for short words starting with åäö such as "än" (becomes e. g. "舅"). Also, since two letters are combined, the mojibake seems more random (over 50 variants compared to the normal 3, not counting the rarer capitals).