| bzip2 | |
|---|---|
| File name extension | . A filename extension is a suffix to the name of a Computer file applied to indicate the encoding convention ( File format) of its contents bz2 |
| Internet media type | application/x-bzip |
| Type code | Bzp2 |
| Magic number | BZh |
| Developed by | Julian Seward |
| Type of format | Data compression |
| bzip2 | |
|---|---|
| Developed by | Julian Seward |
| Latest release | 1. An Internet media type, originally called a MIME type after MIME and sometimes a Content-type after the name of a header in several protocols whose value A type code is the only mechanism used in pre- Mac OS X versions of the Macintosh Operating system to denote a file's format, in a manner similar A file format is a particular way to encode information for storage in a Computer file. Julian Seward is a Compiler writer and Free Software contributor A software developer is a person or organization concerned with facets of the software development process wider than design and coding a somewhat broader scope of Julian Seward is a Compiler writer and Free Software contributor A software release is the distribution whether public or private of an initial or new and upgraded version of a Computer software product 0. 5 / March 17, 2008 |
| OS | Cross-platform |
| Genre | data compression |
| License | Bzip2 |
| Website | bzip.org |
bzip2 is a free and open source lossless data compression algorithm and program developed by Julian Seward. Events 45 BC - In his last victory Julius Caesar defeats the Pompeian forces of Titus Labienus and Pompey the Younger 2008 ( MMVIII) is the current year in accordance with the Gregorian calendar, a Leap year that started on Tuesday of the Common An operating system (commonly abbreviated OS and O/S) is the software component of a Computer system that is responsible for the management and coordination In computing cross-platform (also known as multi-platform) is a term used to refer to Computer software or computing methods and concepts that are implemented Computer software can be organized into categories based on common function type or field of use A software license (or software licence in commonwealth usage is a Legal instrument governing the usage or redistribution of copyright protected software A website (alternatively web site or Web site, a back-construction from the Proper noun World Wide Web) is a collection of Web pages Free software or software libre is Software that can be used studied and modified without restriction and which can be copied and redistributed in modified or unmodified Open source software (OSS began as a marketing campaign for Free software. Lossless data compression is a class of Data compression Algorithms that allows the exact original data to be reconstructed from the compressed data In Mathematics, Computing, Linguistics and related subjects an algorithm is a sequence of finite instructions often used for Calculation Julian Seward is a Compiler writer and Free Software contributor Seward made the first public release of bzip2, version 0. 15, in July 1996. The compressor's stability and popularity grew over the next several years, and Seward released version 1. 0 in late 2000.
Contents |
bzip2 compresses most files more effectively than more traditional gzip or ZIP but is slower. gzip is a Software application used for File compression. gzip is short for GNU zip; the program is a Free software replacement for the The ZIP File format is a Data compression and archival format. In this manner it is fairly similar to other recent-generation compression algorithms. Unlike other formats such as RAR or ZIP (but similar to gzip), bzip2 is only a data compressor, not an archiver. The program itself has no facilities for multiple files, encryption or archive-splitting, in the UNIX tradition instead relying on separate external utilities such as tar and GnuPG for these tasks. Unix (officially trademarked as UNIX, sometimes also written as Unix with Small caps) is a computer In Computing, tar (derived from tape archive) is both a File format (in the form of a type of archive Bitstream) and the name GNU Privacy Guard ( GnuPG or GPG) is a replacement for the PGP suite of cryptographic software
As stated in the front page of the bzip2 Web site, in most cases bzip2 is surpassed by PPM algorithms in terms of absolute compression efficiency. Prediction by Partial Matching ( PPM) is an adaptive statistical Data compression technique based on Context modeling and Prediction According to the author, bzip2 gets within ten to fifteen percent of PPM, while being roughly twice as fast at compression and six times faster at decompression.
bzip2 uses the Burrows-Wheeler transform to convert frequently recurring character sequences into strings of identical letters, and then applies a move-to-front transform and finally Huffman coding. The Burrows-Wheeler transform ( BWT, also called block-sorting compression) is an Algorithm used in Data compression techniques such as The move-to-front (or MTF) transform is an encoding of Data (typically a stream of Bytes designed to improve the performance of History In 1951 David A Huffman and his MIT information theory classmates were given In bzip2 the blocks are generally all the same size in plaintext, which can be selected by a command-line argument between 100 kB–900 kB. A kilobyte (derived from the SI prefix Kilo -, meaning 1000 is a unit of Information or Computer storage equal to either 1024 Compression blocks are delimited by a 48-bit sequence (magic number) derived from the binary-coded decimal representation of π, 0x314159265359, with the end-of-stream similarly delimited by a value representing sqrt(π), 0x177245385090. In Computer programming, the term magic number has multiple meanings In Computing and electronic systems binary-coded decimal ( BCD) is an encoding for decimal numbers in which each digit is represented by its own binary IMPORTANT NOTICE Please note that Wikipedia is not a database to store the millions of digits of π please refrain from adding those to Wikipedia as it could cause technical problems In Mathematics, a square root of a number x is a number r such that r 2 = x, or in words a number r whose IMPORTANT NOTICE Please note that Wikipedia is not a database to store the millions of digits of π please refrain from adding those to Wikipedia as it could cause technical problems
Originally, bzip2's ancestor bzip used arithmetic coding after the blocksort; this was discontinued because of the patent restriction to be replaced by the Huffman coding currently used in bzip2. Arithmetic coding is a method for Lossless data compression. Normally a string of characters such as the words "hello there" is represented using a fixed number of Software patent does not have a universally accepted definition History In 1951 David A Huffman and his MIT information theory classmates were given
bzip2 is known to be quite slow at compressing, making people opt for alternatives such as gzip when time is an issue. gzip is a Software application used for File compression. gzip is short for GNU zip; the program is a Free software replacement for the This problem is asymmetric, as decompression is relatively fast. Motivated by the large CPU time required for compression, a modified version was created in 2003 that supported multi-threading, giving significant speed improvements on multi-cpu and multi-core computers. A thread in Computer science is short for a thread of execution. As of January 2008 this functionality has not been incorporated into the main project.
Bzip2 uses several layers of compression techniques stacked on top of each other, which occur in the following order during compression and the reverse order during decompression:
"AAAAAAABBBBCCCD" is replaced with "AAAA\3BBBB\0CCCD". Runs of symbols are always transformed after four consecutive symbols, even if the run-length is set to zero, to keep the transformation reversible. In the worst case, it can cause a pre-BWT expansion of 1. 25 and in the best case a reduction to <0. 02 of original size. Note that while the specification theoretically allows for runs of length 256–259 to be encoded, the reference encoder will not produce such output. The author of bzip2 has stated that the RLE step was a historical mistake[1] and was only intended to protect the original BWT implementation from pathological cases. RUNA and RUNB, which represent the run-length as a binary number greater than one (1). Run-length encoding ( RLE) is a very simple form of Data compression in which runs of data (that is sequences in which the same data value occurs in many The sequence 0,0,0,0,0,1 would be represented as 0,RUNB,RUNA,1; RUNB and RUNA representing the value 4 in decimal. The run-length code is terminated by reaching another normal symbol. This RLE process is more flexible than the RLE of step 1, as it is able to encode arbitrarily long integers (in practice, this is usually limited by the block size, so that this step does not encode a run of more than 900000 bytes). The run-length is encoded in this fashion: assigning place values of 1 to the first bit, 2 to the second, 4 to the third, etc. in the RUNA/RUNB sequence, multiply each place value in a RUNB spot by 2, and add all the resulting place values (for RUNA and RUNB values alike) together. Thus, the sequence RUNB, RUNA results in the value (1*2 + 2) = 4. As a more complicated example:
RUNA RUNB RUNA RUNA RUNB (ABAAB)
1 2 4 8 16
1 4 4 8 32 = 490: RUNA
1: RUNB
2-257: byte values 0-255
258: end of stream, finish processing. (could be as low as 2). A . bz2 stream consists of a 4-byte header, followed by zero or more compressed blocks, immediately followed by an end-of-stream marker containing a 32-bit CRC for the plaintext whole stream processed. The compressed blocks are bit-aligned and no padding occurs.
. magic:16 = 'BZ' signature/magic number
. version:8 = 'h' for Bzip2 ('H'uffman coding), '0' for Bzip1 (deprecated)
. hundred_k_blocksize:8 = '1'. . '9' block-size 100 kB-900 kB
. compressed_magic:48 = 0x314159265359 (BCD (pi))
. crc:32 = checksum for this block
. randomised:1 = 0=>normal, 1=>randomised (deprecated)
. origPtr:24 = starting pointer into BWT for after untransform
. huffman_used_map:16 = bitmap, of ranges of 16 bytes, present/not present
. huffman_used_bitmaps:0. . 256 = bitmap, of symbols used, present/not present (multiples of 16)
. huffman_groups:3 = 2. . 6 number of different Huffman tables in use
. selectors_used:15 = number of times that the Huffman tables are swapped (each 50 bytes)
*. selector_list:1. . 6 = zero-terminated bit runs (0. . 62) of MTF'ed Huffman table (*selectors_used)
. start_huffman_length:5 = 0. . 20 starting bit length for Huffman deltas
*. delta_bit_length:1. . 40 = 0=>next symbol; 1=>alter length
{ 1=>decrement length; 0=>increment length } (*(symbols+2)*groups)
. contents:2. . ∞ = Huffman encoded data stream until end of block
. eos_magic:48 = 0x177245385090 (BCD sqrt(pi))
. crc:32 = checksum for whole stream
. padding:0. . 7 = align to whole byte
Note for implementors: Because of the first-stage RLE compression (see above), the maximum length of plaintext that a single 900 kB bzip2 block can contain is around 46 MB (45,899,235 bytes). This can occur if the whole plaintext consists entirely of repeated values (the resulting . bz2 file in this case is 46 bytes long). [2]
In Unix, bzip2 can be used combined with or independently of tar: bzip2 file to compress and bzip2 -d file. Unix (officially trademarked as UNIX, sometimes also written as Unix with Small caps) is a computer bz2 to uncompress (the alias bunzip2 for decompression may also be used).
bzip2's command line flags are mostly the same as in gzip. gzip is a Software application used for File compression. gzip is short for GNU zip; the program is a Free software replacement for the So, to extract from a bzip2-compressed tar-file:
bzip2 -d <archivefile. tar. bz2 | tar -xf - or bunzip2 <archivefile. tar. bz2 | tar -xf -
To create a bzip2-compressed tar-file:
tar -cf - filenames | bzip2 >archivefile. tar. bz2
GNU tar supports a -j flag, which allows creation of tar. GNU ( pronounced) is a computer Operating system composed entirely of Free software. bz2 files without a pipeline:
tar -cjf archivefile. tar. bz2 file-list
Decompressing in GNU tar:
tar -xjf archivefile. tar. bz2
libbzip2 that has SMP parallelisation "hacked in" by Konstantin Isakov. In Computing, symmetric multiprocessing or SMP involves a Multiprocessor computer-architecture where two or more identical processors can connect to a single $ dd if=/dev/zero bs=45899235 count=1 | bzip2 -vvvv | wc -c
An even smaller file of 40 bytes can be achieved by using an input containing entirely values of 251, an apparent compression ratio of 1147480:1. The Lempel-Ziv-Markov chain-Algorithm ( LZMA) is an Algorithm used to perform Data compression. Mac OS is the trademarked name for a series of Graphical user interface -based Operating systems developed by Apple Inc Mac OS X (mæk oʊ ɛs tɛn is a line of computer Operating systems developed marketed and sold by Apple Inc, the latest of which is pre-loaded on all currently