In cryptanalysis, frequency analysis is the study of the frequency of letters or groups of letters in a ciphertext. Cryptanalysis (from the Greek kryptós, "hidden" and analýein, "to loosen" or "to untie" is the study of methods for The frequency of letters in text has often been studied for use in Cryptography, and Frequency analysis in particular The method is used as an aid to breaking classical ciphers. In Cryptography, a classical cipher is a type of Cipher used historically but which now have fallen for the most part into disuse
Frequency analysis is based on the fact that, in any given stretch of written language, certain letters and combinations of letters occur with varying frequencies. Moreover, there is a characteristic distribution of letters that is roughly the same for almost all samples of that language. For instance, given a section of English language, E tends to be very common, while X is very rare. English is a West Germanic language originating in England and is the First language for most people in the United Kingdom, the United States Likewise, ST, NG, TH, and QU are common pairs of letters (termed bigrams or digraphs), while NZ and QJ are rare. Bigrams are groups of two written letters two syllables or two words and are very commonly used as the basis for simple statistical analysis of text The phrase "ETAOIN SHRDLU" encodes the 12 most frequent letters in typical English language text. ETAOIN SHRDLU is the approximate order of frequency of the twelve most commonly used letters in the English language, best known as a nonsense phrase that sometimes
In some ciphers, such properties of the natural language plaintext are preserved in the ciphertext, and these patterns have the potential to be exploited in a ciphertext-only attack. In Cryptography, a ciphertext-only attack (COA or known ciphertext attack is an Attack model for Cryptanalysis where the attacker is assumed
Contents |
In a simple substitution cipher, each letter of the plaintext is replaced with another, and any particular letter in the plaintext will always be transformed into the same letter in the ciphertext. In Cryptography, a substitution cipher is a method of Encryption by which units of plaintext are substituted with Ciphertext according to a regular system In Cryptography, plaintext is the information which the sender wishes to transmit to the receiver(s For instance, if all occurrences of the letter e turn into the letter X, a ciphertext message containing numerous instances of the letter X would suggest to a cryptanalyst that X represents e.
The basic use of frequency analysis is to first count the frequency of ciphertext letters and then associate guessed plaintext letters with them. More X's in the ciphertext than anything else suggests that X corresponds to e in the plaintext, but this is not certain; t and a are also very common in English, so X might be either of them also. It is unlikely to be a plaintext z or q which are less common. Thus the cryptanalyst may need to try several combinations of mappings between ciphertext and plaintext letters.
More complex use of statistics can be conceived, such as considering counts of pairs of letters, or triplets (trigrams), and so on. This is done to provide more information to the cryptanalyst, for instance, Q and U nearly always occur together in that order in English, even though Q itself is rare.
Suppose Evelina has intercepted the cryptogram below, and it is known to be encrypted using a simple substitution cipher:
LIVITCSWPIYVEWHEVSRIQMXLEYVEOIEWHRXEXIPFEMVEWHKVSTYLXZIXLIKIIXPIJVSZEYPERRGERIMWQLMGLMXQERIWGPSRIHMXQEREKIETXMJTPRGEVEKEITREWHEXXLEXXMZITWAWSQWXSWEXTVEPMRXRSJGSTVRIEYVIEXCVMUIMWERGMIWXMJMGCSMWXSJOMIQXLIVIQIVIXQSVSTWHKPEGARCSXRWIEVSWIIBXVIZMXFSJXLIKEGAEWHEPSWYSWIWIEVXLISXLIVXLIRGEPIRQIVIIBGIIHMWYPFLEVHEWHYPSRRFQMXLEPPXLIECCIEVEWGISJKTVWMRLIHYSPHXLIQIMYLXSJXLIMWRIGXQEROIVFVIZEVAEKPIEWHXEAMWYEPPXLMWYRMWXSGSWRMHIVEXMSWMGSTPHLEVHPFKPEZINTCMXIVJSVLMRSCMWMSWVIRCIGXMWYMX
For this example, uppercase letters are used to denote ciphertext, lowercase letters are used to denote plaintext (or guesses at such), and X~t is used to express a guess that ciphertext letter X represents the plaintext letter t. A cryptogram is a type of puzzle which consists of a short piece of encrypted text
Eve could use frequency analysis to help solve the message along the following lines: counts of the letters in the cryptogram show that I is the most common single letter, XL most common bigram, and XLI is the most common trigram. e is the most common letter in the English language, th is the most common bigram, and the the most common trigram. This strongly suggests that X~t, L~h and I~e. The second most common letter in the cryptogram is E; since the first and second most frequent letters in the English language, e and t are accounted for, Eve guesses that E~a, the third most frequent letter. Tentatively making these assumptions, the following partial decrypted message is obtained.
heVeTCSWPeYVaWHaVSReQMthaYVaOeaWHRtatePFaMVaWHKVSTYhtZetheKeetPeJVSZaYPaRRGaReMWQhMGhMtQaReWGPSReHMtQaRaKeaTtMJTPRGaVaKaeTRaWHatthattMZeTWAWSQWtSWatTVaPMRtRSJGSTVReaYVeatCVMUeMWaRGMeWtMJMGCSMWtSJOMeQtheVeQeVetQSVSTWHKPaGARCStRWeaVSWeeBtVeZMtFSJtheKaGAaWHaPSWYSWeWeaVtheStheVtheRGaPeRQeVeeBGeeHMWYPFhaVHaWHYPSRRFQMthaPPtheaCCeaVaWGeSJKTVWMRheHYSPHtheQeMYhtSJtheMWReGtQaROeVFVeZaVAaKPeaWHtaAMWYaPPthMWYRMWtSGSWRMHeVatMSWMGSTPHhaVHPFKPaZeNTCMteVJSVhMRSCMWMSWVeRCeGtMWYMt
Using these initial guesses, Eve can spot patterns that confirm her choices, such as "that". Moreover, other patterns suggest further guesses. "Rtate" might be "state", which would mean R~s. Similarly "atthattMZe" could be guessed as "atthattime", yielding M~i and Z~m. Furthermore, "heVe" might be "here", giving V~r. Filling in these guesses, Eve gets:
hereTCSWPeYraWHarSseQithaYraOeaWHstatePFairaWHKrSTYhtmetheKeetPeJrSmaYPassGaseiWQhiGhitQaseWGPSseHitQasaKeaTtiJTPsGaraKaeTsaWHatthattimeTWAWSQWtSWatTraPistsSJGSTrseaYreatCriUeiWasGieWtiJiGCSiWtSJOieQthereQeretQSrSTWHKPaGAsCStsWearSWeeBtremitFSJtheKaGAaWHaPSWYSWeWeartheStherthesGaPesQereeBGeeHiWYPFharHaWHYPSssFQithaPPtheaCCearaWGeSJKTrWisheHYSPHtheQeiYhtSJtheiWseGtQasOerFremarAaKPeaWHtaAiWYaPPthiWYsiWtSGSWsiHeratiSWiGSTPHharHPFKPameNTCiterJSrhisSCiWiSWresCeGtiWYit
In turn, these guesses suggest still others (for example, "remarA" could be "remark", implying A~k) and so on, and it is relatively straightforward to deduce the rest of the letters, eventually yielding the plaintext.
hereuponlegrandarosewithagraveandstatelyairandbroughtmethebeetlefromaglasscaseinwhichitwasencloseditwasabeautifulscarabaeusandatthattimeunknowntonaturalistsofcourseagreatprizeinascientificpointofviewthereweretworoundblackspotsnearoneextremityofthebackandalongoneneartheotherthescaleswereexceedinglyhardandglossywithalltheappearanceofburnishedgoldtheweightoftheinsectwasveryremarkableandtakingallthingsintoconsiderationicouldhardlyblamejupiterforhisopinionrespectingit
At this point, it would be a good idea for Eve to insert spaces:
Here upon le grand arose with a grave and stately air and brought me the beetle from a glass case in which it was enclosed. It was a beautiful scarabaeus andat that time unknown to naturalists of course; a great prize in a scientificpoint of view. There were two round black spots near one extremity of the backand a long one near the other. The scales were exceedingly hard and glossy withall the appearance of burnished gold. The weight of the insect was veryremarkable and taking all things into consideration I could hardly blame jupiterfor his opinion respecting it.
In this example from The Gold-Bug, Eve's guesses were all correct. " The Gold-Bug " is a Short story by Edgar Allan Poe, set on Sullivan's Island, South Carolina involving deciphering a secret message This would not always be the case, however; the variation in statistics for individual plaintexts can mean that initial guesses are incorrect. It may be necessary to backtrack incorrect guesses or to analyze the available statistics in much more depth than the somewhat simplified justifications given in the above example. Backtracking is a type of Algorithm that is a refinement of Brute force search.
It is also possible that the plaintext does not exhibit the expected distribution of letter frequencies. Shorter messages are likely to show more variation. It is also possible to construct artificially skewed texts. For example, entire novels have been written that omit the letter "e" altogether — a form of literature known as a lipogram. A lipogram (from Greek lipagrammatos, "missing letter" is a kind of Constrained writing or Word game consisting of writing paragraphs
The first known recorded explanation of frequency analysis (indeed, of any kind of cryptanalysis) was given by 9th century Arab polymath Abu Yusuf Yaqub ibn Ishaq al-Sabbah Al-Kindi in A Manuscript on Deciphering Cryptographic Messages [1]. ( أبو يوسف يعقوب إبن إسحاق الكندي) (c The 9th century is the period from 801 to 900 in accordance with the Julian calendar in the Christian / Common Era. The araB gene Promoter is a bacterial promoter activated by e L-arabinose binding A polymath ( Greek polymathēs, πολυμαθής "having learned much" is a person whose knowledge is not restricted to one subject area ( أبو يوسف يعقوب إبن إسحاق الكندي) (c It has been suggested that close textual study of the Qur'an first brought to light that Arabic has a characteristic letter frequency. The Qur’an ( القرآن, literally "the recitation" also sometimes transliterated as Qur’ān, Koran, Alcoran Arabic (ar الْعَرَبيّة (informally ar عَرَبيْ) in terms of the number of speakers is the largest living member of the Semitic language Its use spread, and similar systems were widely used in European states by the time of the Renaissance. The Renaissance (from French Renaissance, meaning "rebirth" Italian: Rinascimento, from re- "again" and nascere By 1474 Cicco Simonetta had written a manual on deciphering encryptions of Latin and Italian text. Francesco (Cicco Simonetta (born Caccuri 1410 died Pavia 1480 was an Italian secretary and statesman Latin ( lingua Latīna, laˈtiːna is an Italic language, historically spoken in Latium and Ancient Rome. Italian ( or lingua italiana) is a Romance language spoken by about 63 million people as a First language, primarily in Italy. [2]
Several schemes were invented by cryptographers to defeat this weakness in simple substitution encryptions. These included:
A disadvantage of all these attempts to defeat frequency counting attacks is that it increases complication of both enciphering and deciphering, leading to mistakes. Famously, a British Foreign Secretary is said to have rejected the Playfair cipher because, even if school boys could cope successfully as Wheatstone and Playfair had shown, 'our attachés could never learn it!'.
The rotor machines of the first half of the 20th century (for example, the Enigma machine) were essentially immune to straightforward frequency analysis. In Cryptography, a rotor machine is an electro-mechanical device used for encrypting and decrypting secret messages The Enigma machine is any one of a family of related electro-mechanical Rotor machines used to generate Ciphers for the Encryption and decryption of However, other kinds of analysis ("attacks") successfully decoded messages from some of those machines.
Frequency analysis requires only a basic understanding of the statistics of the plaintext language and some problem solving skills, and, if performed by hand, some tolerance for extensive letter bookkeeping. During World War II (WWII), both the British and the Americans recruited codebreakers by placing crossword puzzles in major newspapers and running contests for who could solve them the fastest. World War II, or the Second World War, (often abbreviated WWII) was a global military conflict which involved a majority of the world's nations, including The United Kingdom of Great Britain and Northern Ireland, commonly known as the United Kingdom, the UK or Britain,is a Sovereign state located The United States of America —commonly referred to as the Several of the ciphers used by the Axis powers were breakable using frequency analysis (for example, some of the consular ciphers used by the Japanese). The Axis powers also known as the Axis alliance Axis nations Axis countries or sometimes just the Axis were those Countries Mechanical methods of letter counting and statistical analysis (generally IBM card type machinery) were first used in WWII, possibly by the US Army's SIS. The Signals Intelligence Service (SIS was the United States Army Codebreaking division headquartered at Arlington Hall. Today, the hard work of letter counting and analysis has been replaced by computer software, which can carry out such analysis in seconds. A computer is a Machine that manipulates data according to a list of instructions. With modern computing power, classical ciphers are unlikely to provide any real protection for confidential data.
Frequency analysis has been described in fiction. Edgar Allan Poe's "The Gold-Bug," and Sir Arthur Conan Doyle's Sherlock Holmes tale "The Adventure of the Dancing Men" are examples of stories which describe the use of frequency analysis to attack simple substitution ciphers. Edgar Allan Poe (January 19 1809 – October 7 1849 was an American poet, short-story Writer, editor and Literary critic, " The Gold-Bug " is a Short story by Edgar Allan Poe, set on Sullivan's Island, South Carolina involving deciphering a secret message Sir Arthur Ignatius Conan Doyle, DL (22 May 1859 – 7 July 1930 was an Anglo-Scottish Author most noted for his stories about the Sherlock Holmes is a famous fictional detective of the late nineteenth and early twentieth centuries who first appeared in Publication in 1887 The Adventure of the Dancing Men, one of the 56 Sherlock Holmes short stories written by British author Sir Arthur Conan Doyle, is one of 13 stories in the cycle The cipher in the Poe story is encrusted with several deception measures, but this is more a literary device than anything significant cryptographically.