Lempel-Ziv-Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. Lossless data compression is a class of Data compression Algorithms that allows the exact original data to be reconstructed from the compressed data In Mathematics, Computing, Linguistics and related subjects an algorithm is a sequence of finite instructions often used for Calculation Abraham Lempel (אברהם למפל born 10 february 1936 in Lvov, Poland) is an Israeli computer scientist and one Jacob Ziv (יעקב זיו 1931 - is an Israeli computer scientist who along with Abraham Lempel, developed the Lossless LZ77 Terry A Welch, along with Abraham Lempel and Jacob Ziv, developed the Lossless LZW Compression algorithm which was published in It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. Year 1984 ( MCMLXXXIV) was a Leap year starting on Sunday (link displays the 1984 Gregorian calendar) LZ77 and LZ78 are the names for the two Lossless data compression Algorithms published in papers by Abraham Lempel and Jacob Ziv in Year 1978 ( MCMLXXVIII) was a Common year starting on Sunday (link displays the 1978 Gregorian calendar) The algorithm is designed to be fast to implement but is not usually optimal because it performs only limited analysis of the data.
Contents |
The compressor algorithm builds a string translation table from the text being compressed. In Computer programming and some branches of Mathematics, a string is an ordered Sequence of Symbols. In Relational databases and Flat file databases a table is a set of data elements (values that is organized using a model of vertical columns (which are The string translation table maps fixed-length codes (usually 12-bit) to strings. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication The string table is initialized with all single-character strings (256 entries in the case of 8-bit characters). For other uses see Character. In Computer and machine-based Telecommunications terminology a character is a unit of As the compressor character-serially examines the text, it stores every unique two-character string into the table as a code/character concatenation, with the code mapping to the corresponding first character. For concatenation of general lists see Append. In Computer programming, string concatenation is the operation of joining two character As each two-character string is stored, the first character is sent to the output. Whenever a previously-encountered string is read from the input, the longest such previously-encountered string is determined, and then the code for this string concatenated with the extension character (the next character in the input) is stored in the table. The code for this longest previously-encountered string is output and the extension character is used as the beginning of the next word.
The decompressor algorithm only requires the compressed text as an input, since it can build an identical string table from the compressed text as it is recreating the original text. However, an abnormal case shows up whenever the sequence character/string/character/string/character (with the same character for each character and string for each string) is encountered in the input and character/string is already stored in the string table. When the decompressor reads the code for character/string/character in the input, it cannot resolve it because it has not yet stored this code in its table. This special case can be dealt with because the decompressor knows that the extension character is the previously-encountered character. [1]
Compressor algorithm:
w = NIL; while (read a char c) do if (wc exists in dictionary) then w = wc; else add wc to the dictionary; output the code for w; w = c; endif done output the code for w;Decompressor algorithm:
read a char k; output k; w = k; while (read a char k) do if (index k exists in dictionary) then entry = dictionary entry for k; else if (index k does not exist in dictionary && k == currSizeDict) entry = w + w[0]; else signal invalid code; endif output entry; add w+entry[0] to the dictionary; w = entry; doneThe following table shows the result of executing compressor algorithm on this input:
TOBEORNOTTOBEORTOBEORNOT
Suppose that we are using 256 (8-bit) ASCII code as the default dictionary. The length of this input is 24 characters. So this input requires 24 * 8 = 192 bits to store.
| c | w | wc | output | dictionary |
| T | <NIL> | T | ||
| O | T | TO | T | TO = <256> |
| B | O | OB | O | OB = <257> |
| E | B | BE | B | BE = <258> |
| O | E | EO | E | EO = <259> |
| R | O | OR | O | OR = <260> |
| N | R | RN | R | RN = <261> |
| O | N | NO | N | NO = <262> |
| T | O | OT | O | OT = <263> |
| T | T | TT | T | TT = <264> |
| O | T | TO | ||
| B | TO | TOB | <256> | TOB = <265> |
| E | B | BE | ||
| O | BE | BEO | <258> | BEO = <266> |
| R | O | OR | ||
| T | OR | ORT | <260> | ORT = <267> |
| O | T | TO | ||
| B | TO | TOB | ||
| E | TOB | TOBE | <265> | TOBE = <268> |
| O | E | EO | ||
| R | EO | EOR | <259> | EOR = <269> |
| N | R | RN | ||
| O | RN | RNO | <261> | RNO = <270> |
| T | O | OT | ||
| OT | <263> |
After compression we have this sequence of 9-bits codes in output:
TOBEORNOT<256><258><260><265><259><261><263>
This sequence requires 16 * 9 = 144 bits to store.
| k | w | entry | w+entry[0] | output | dictionary |
| 84 | T | T | |||
| 77 | T | O | TO | O | TO = <256> |
| 66 | O | B | OB | B | OB = <257> |
| 69 | B | E | BE | E | BE = <258> |
| 77 | E | O | EO | O | EO = <259> |
| 80 | O | R | OR | R | OR = <260> |
| 76 | R | N | RN | N | RN = <261> |
| 77 | N | O | NO | O | NO = <262> |
| 84 | O | T | OT | T | OT = <263> |
| <256> | T | TO | TT | TO | TT = <264> |
| <258> | TO | BE | TOB | BE | TOB = <265> |
| <260> | BE | OR | BEO | OR | BEO = <266> |
| <265> | OR | TOB | ORT | TOB | ORT = <267> |
| <259> | TOB | EO | TOBE | EO | TOBE = <268> |
| <261> | EO | RN | EOR | RN | EOR = <269> |
| <263> | RN | OT | RNO | OT | RNO = <270> |
The method became widely used in the program compress, which became a more or less standard utility in Unix systems circa 1986. compress is a UNIX compression program based on the LZC compression method which is an LZW implementation using variable size pointers as in Unix (officially trademarked as UNIX, sometimes also written as Unix with Small caps) is a computer (It has since disappeared from many for both legal and technical reasons. ) Several other popular compression utilities also used the method, or closely related ones.
It became very widely used after it became part of the GIF image format in 1987. Year 1987 ( MCMLXXXVII) was a Common year starting on Thursday (link displays 1987 Gregorian calendar) It may also (optionally) be used in TIFF files.
LZW compression provided a better compression ratio, in most applications, than any well-known method available up to that time. It became the first widely used universal data compression method on computers. It would typically compress large English texts to about half of their original sizes. English is a West Germanic language originating in England and is the First language for most people in the United Kingdom, the United States
Today, an implementation of the algorithm is contained within the popular Adobe Acrobat software program. Adobe Acrobat is a family of computer programs developed by Adobe Systems, designed to view create manipulate and manage files in Adobe's Portable Document
This example shows the LZW algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding the message. A dictionary coder, also sometimes known as a substitution coder, is a class of Lossless data compression algorithms which operate by searching for matches between In order to keep things clear, let us assume that we're dealing with a simple alphabet - capital letters only, and no punctuation or spaces. This example has been constructed to give reasonable compression on a very short message; when used on real data, repetition is generally less pronounced, and so the initial parts of a message will see little compression. As the message grows, however, the compression ratio tends asymptotically to the maximum. Data compression ratio, also known as compression power, is a computer-science term used to quantify the reduction in data-representation size produced by a data compression [2] A message to be sent might then look like the following:
TOBEORNOTTOBEORTOBEORNOT#
The # is a marker used to show that the end of the message has been reached. Clearly, then, we have 27 symbols in our alphabet (the 26 capital letters A through Z, plus the # character). A computer will render these as strings of bits; 5-bit strings are needed to give sufficient combinations to encompass the entire dictionary. A bit is a binary digit, taking a value of either 0 or 1 Binary digits are a basic unit of Information storage and communication As the dictionary grows, the strings will need to grow in length to accommodate the additional entries. A 5-bit string gives 25 = 32 possible combinations of bits, and so when the 33rd dictionary word is created, the algorithm will have to start using 6-bit strings (for all strings, including those which were previously represented by only five bits). Note that since the all-zero string 00000 is used, and is labeled "0", the 33rd dictionary entry will be labeled 32. The initial dictionary, then, will consist of the following:
# = 00000A = 00001B = 00010C = 00011. . . Z = 11010
If we weren't using LZW, and just sent the message as it stands (25 symbols at 5 bits each), it would require 125 bits. We will be able to compare this figure to the LZW output later. We are now in a position to apply LZW to the message.
Symbol: Bit Code: New Dictionary Entry: (= output)T 20 = 10100O 15 = 01111 28: TO <--- Don't forget, we originally had 27 symbols, so the next one is 28th. B 2 = 00010 29: OBE 5 = 00101 30: BEO 15 = 01111 31: EO <--- start using 6-bit stringsR 18 = 010010 32: ORN 14 = 001110 33: RNO 15 = 001111 34: NOT 20 = 010100 35: OTTO 28 = 011100 36: TTBE 30 = 011110 37: TOBOR 32 = 100000 38: BEOTOB 37 = 100101 39: ORTEO 31 = 011111 40: TOBERN 33 = 100001 41: EOROT 35 = 100011 42: RNO# 0 = 000000 43: OT#
This is somewhat clearer:
Current Next Output Value ExtendedSequence Char (# of bits) DictionaryNULL T T O 20 = 5 bits 27: TO <-- This IS the 28th entry, but the initial entries are numbered 0-26 so this is #27. O B 15 = 5 bits 28: OBB E 2 = 5 bits 29: BEE O 5 = 5 bits 30: EOO R 15 = 5 bits 31: OR R N 18 = 6 bits 32: RN <-- Starting at R, 6 bits are used {floor(lg2(init_dict_size + num_chars_output)) + 1}N O 14 = 6 bits 33: NO i. e. O: floor(lg2(27 + 4)) + 1 = 5 bits -> 01111O T 15 = 6 bits 34: OT R: floor(lg2(27 + 5)) + 1 = 6 bits -> 010010T T 20 = 6 bits 35: TTTO B 27 = 6 bits 36: TOBBE O 29 = 6 bits 37: BEOOR T 31 = 6 bits 38: ORTTOB E 36 = 6 bits 39: TOBEEO R 30 = 6 bits 40: EORRN O 32 = 6 bits 41: RNOOT # 34 = 6 bits 42: OT## 0 = 6 bitsTotal Length = 5*5 + 12*6 = 97 bits. In using LZW we have made a saving of 28 bits out of 125 -- we have reduced the message by almost 22%. If the message were longer, then the dictionary words would begin to represent longer and longer sections of text, allowing repeated words to be sent very compactly.
Imagine now that we have received the message produced above, and wish to decode it. We need to know in advance the initial dictionary used, but we can reconstruct the additional entries as we go, since they are always simply concatenations of previous entries. For concatenation of general lists see Append. In Computer programming, string concatenation is the operation of joining two character
Bits: Output: New Entry: Full: Partial: 10100 = 20 T 28: T? 01111 = 15 O 28: TO 29: O? 00010 = 2 B 29: OB 30: B? 00101 = 5 E 30: BE 31: E? 01111 = 15 O 31: EO 32: O? <--- start using 6-bit strings010010 = 18 R 32: OR 33: R?001110 = 14 N 33: RN 34: N?001111 = 15 O 34: NO 35: O?010100 = 20 T 35: OT 36: T?011100 = 28 TO 36: TT 37: TO? <--- for 36, only add 1st element011110 = 30 BE 37: TOB 38: BE? of next dictionary word100000 = 32 OR 38: BEO 39: OR?100101 = 37 TOB 39: ORT 40: TOB?011111 = 31 EO 40: TOBE 41: EO?100001 = 33 RN 41: EOR 42: RN?100011 = 35 OT 42: RNO 43: OT?000000 = 0 #
The only slight complication comes if the newly-created dictionary word is sent immediately. In the decoding example above, when the decoder receives the first symbol, T, it knows that symbol 28 begins with a T, but what does it end with? The problem is illustrated below. We are decoding part of a message that reads ABABA:
Bits: Output: New Entry: Full: Partial:. . . 011101 = 29 AB 46: (word) 47: AB?101111 = 47 AB? <--- what do we do here?
At first glance, this may appear to be asking the impossible of the decoder. We know ahead of time that entry 47 should be ABA, but how can the decoder work this out? The critical step is to note that 47 is built out of 29 plus whatever comes next. 47, therefore, ends with "whatever comes next". But, since it was sent immediately, it must also start with "whatever comes next", and so must end with the same symbol it starts with, namely A. This trick allows the decoder to see that 47 must be ABA.
More generally the situation occurs whenever the encoder encounters the input of the form cScSc, where c is a single character, S is a string and cS is already in the dictionary. The encoder outputs the symbol for cS putting new symbol for cSc in the dictionary. Next it sees the cSc in the input and sends the new symbol it just inserted into the dictionary. By the reasoning presented in the above example this is the only case where the newly-created symbol is sent immediately.
# Lempel-Ziv-Welch compression algorithm def compress(uncompressed): """Compress a string to a list of output symbols. Python is a general-purpose High-level programming language. Its design philosophy emphasizes programmer productivity and code readability """ # Build the dictionary. dict_size = 256 dictionary = {} for i in range(dict_size): dictionary[chr(i)] = chr(i) w = '' result = [] for c in uncompressed: wc = w + c if wc in dictionary: w = wc else: result. append(dictionary[w]) # Add wc to the dictionary. dictionary[wc] = dict_size dict_size += 1 w = c # Output the code for w. if w: result += [char for char in w] return result def decompress(compressed): """Decompress a list of output ks to a string. """ # Build the dictionary. dict_size = 256 dictionary = {} for i in range(dict_size): dictionary[chr(i)] = chr(i) w = result = compressed. pop(0) for k in compressed: if k in dictionary: entry = dictionary[k] elif k == len(dictionary): entry = w + w[0] else: raise ValueError, 'Bad compressed k: %s' % k result += entry # Add w+entry[0] to the dictionary. dictionary[dict_size] = w+entry[0] dict_size += 1 w = entry return result
How to use:
compressed = compress('TOBEORNOTTOBEORTOBEORNOT')print compresseddecompressed = decompress(compressed)print decompressed
Various patents have been issued in the United States and other countries for LZW and similar algorithms. A patent is a set of Exclusive rights granted by a State to an inventor or his assignee for a fixed period of time in exchange for a disclosure of an The United States of America —commonly referred to as the LZ78 was covered by U.S. Patent 4,464,650 by Lempel, Ziv, Cohn, and Eastman, assigned to Sperry Corporation, later Unisys Corporation, filed on August 10, 1981. Sperry Corporation (1910-1986 was a major American equipment and Electronics company whose existence spanned more than seven decades of the twentieth century Unisys Corporation ( based in Blue Bell, Pennsylvania, United States, and incorporated in Delaware, is a global provider of information technology Events 612 BC - Killing of Sinsharishkun, King of Assyrian Empire Year 1981 ( MCMLXXXI) was a Common year starting on Thursday (link displays the 1981 Two US patents were issued for the LZW algorithm: U.S. Patent 4,814,746 by Victor S. Miller and Mark N. Victor S Miller (b 3 March, 1947 in Brooklyn, New York, USA) is an American mathematician at the Center for Communications Wegman and assigned to IBM, originally filed on June 1, 1983, and U.S. Patent 4,558,302 by Welch, assigned to Sperry Corporation, later Unisys Corporation, filed on June 20, 1983. International Business Machines Corporation abbreviated IBM and nicknamed "Big Blue", is a multinational Computer Technology Events 193 - Roman Emperor Didius Julianus is Assassinated 987 - Hugh Capet is elected Year 1983 ( MCMLXXXIII) was a Common year starting on Saturday (link displays the 1983 Gregorian calendar) Events 451 - Battle of Chalons: Flavius Aetius ' defeats Attila the Hun. On June 20, 2003, this patent on the LZW algorithm expired [1]. Events 451 - Battle of Chalons: Flavius Aetius ' defeats Attila the Hun. Year 2003 ( MMIII) was a Common year starting on Wednesday of the Gregorian calendar.
US Patent 4,558,302 is the one that has caused the most controversy (See Graphics Interchange Format#Unisys and LZW patent enforcement).
Although the name of the algorithm refers to the inventors as Lempel, Ziv and Welch, some people claim that the intellectual property rightly goes to Ziv first, so the method should be called the Ziv-Lempel-Welch algorithm, and not the Lempel-Ziv-Welch algorithm. Others who distinguish between the algorithm and the code prefer calling the algorithm LZ and the code written by Welch as LZW.