In
computer science and
information theory,
data compression or
source coding is the process of encoding information using fewer
bits (or other information-bearing units) than an
unencoded representation would use through use of specific
encoding schemes. For example, this article could be encoded with fewer bits if one were to accept the convention that the word "compression" be encoded as "comp." One popular instance of compression with which many computer users are familiar is the
ZIP file format, which, as well as providing compression, acts as an
archiver, storing many files in a single output file.
As with any communication, compressed data communication only works when both the
sender and receiver of the
information understand the encoding scheme. For example, this text makes sense only if the receiver understands that it is intended to be interpreted as characters representing the English language. Similarly, compressed data can only be understood if the decoding method is known by the receiver.
Compression is useful because it helps reduce the consumption of expensive resources, such as
hard disk space or transmission
bandwidth. On the downside, compressed data must be decompressed to be viewed (or heard), and this extra processing may be detrimental to some applications. For instance, a compression scheme for video may require expensive hardware for the video to be decompressed fast enough to be viewed as it's being decompressed (the option of decompressing the video in full before watching it may be inconvenient, and requires storage space for the decompressed video). The design of data compression schemes therefore involve trade-offs among various factors, including the degree of compression, the amount of distortion introduced (if using a
lossy compression scheme), and the computational resources required to compress and uncompress the data.
Lossless vs. lossy compression The above is a very simple example of
run-length encoding, wherein large runs of consecutive identical data values are replaced by a simple code with the data value and length of the run. This is an example of lossless data compression. It is often used to optimize disk space on office computers, or better use the connection
bandwidth in a
computer network. For symbolic data such as spreadsheets, text,
executable programs, etc., losslessness is essential because changing even a single bit cannot be tolerated (except in some limited cases).
For visual and audio data, some loss of quality can be tolerated without losing the essential nature of the data. By taking advantage of the limitations of the human sensory system, a great deal of space can be saved while producing an output which is nearly indistinguishable from the original. These lossy data compression methods typically offer a three-way tradeoff between compression speed, compressed data size and quality loss.
Lossy
image compression is used in
digital cameras, greatly increasing their storage capacities while hardly degrading picture quality at all. Similarly,
DVDs use the lossy
MPEG-2 codec for
video compression.
In lossy
audio compression, methods of
psychoacoustics are used to remove non-audible (or less audible) components of the
signal. Compression of human speech is often performed with even more specialized techniques, so that "
speech compression" or "voice coding" is sometimes distinguished as a separate discipline than "audio compression". Different audio and speech compression standards are listed under
audio codecs. Voice compression is used in
Internet telephony for example, while audio compression is used for CD ripping and is decoded by audio players.
Applications The theoretical background of compression is provided by
information theory (which is closely related to
algorithmic information theory) and by
rate-distortion theory. These fields of study were essentially created by
Claude Shannon, who published fundamental papers on the topic in the late 1940s and early 1950s. Doyle and Carlson (
2000) wrote that data compression "has one of the simplest and most elegant design theories in all of engineering".
Cryptography and
coding theory are also closely related. The idea of data compression is deeply connected with statistical inference.
Many lossless data compression systems can be viewed in terms of a four-stage model. Lossy data compression systems typically include even more stages, including, for example, prediction, frequency transformation, and quantization.
The Lempel-Ziv (LZ) compression methods are among the most popular algorithms for lossless storage.
DEFLATE is a variation on LZ which is optimized for decompression speed and compression ratio, although compression can be slow. DEFLATE is used in
PKZIP,
gzip and
PNG.
LZW (Lempel-Ziv-Welch) is used in GIF images. Also noteworthy are the LZR (LZ-Renau) methods, which serve as the basis of the Zip method. LZ methods utilize a table-based compression model where table entries are substituted for repeated strings of data. For most LZ methods, this table is generated dynamically from earlier data in the input. The table itself is often
Huffman encoded (e.g. SHRI, LZX). A current LZ based coding scheme that performs well is
LZX, used in Microsoft's
CAB format.
The very best compressors use probabilistic models whose predictions are coupled to an algorithm called
arithmetic coding. Arithmetic coding, invented by Jorma Rissanen, and turned into a practical method by Witten, Neal, and Cleary, achieves superior compression to the better-known Huffman algorithm, and lends itself especially well to adaptive data compression tasks where the predictions are strongly context-dependent. Arithmetic coding is used in the bilevel image-compression standard
JBIG, and the document-compression standard
DjVu. The text
entry system,
Dasher, is an inverse-arithmetic-coder.
Matt Mahoney, one of the 3 founders of the
Hutter Prize, claims that "Compression is Equivalent to General
Intelligence" .
Theory Independent comparison of different methods of data compression (
Results &
Softwares, in French. Airelle, 2007). Numbers in parenthesis are the rank of the method of compression for the category of file specified above.
P.Table(P.T.):
PAQ8 (
kgb archiver is Windows GUI of old PAQ7) is much better than this all, but for the copyright of the table it can't be copied.
Globally, the three best methods tested are
rk,
uha and
7z.
WinRK is a commercial software,
UHarc is a
freeware and only
7-zip is free,
open source (
LGPL licence) and can be used with
Linux.
Text files, such as .htm or .txt, can be hard compressed.
Some files are already compressed (e.g. .mp3 or .zip), so the compression rate of such files is poor. Due to the addition of header data, often there are
diminishing returns in such compression, causing the file to actually be slightly larger upon storage.
To be more representative of the performance, the global score (/20) is calculated with a non-parametric formula after the sum of the ranks (1 to 20) for each of the 20 tested methods.
See also Algorithmic complexity theory Information entropy Self-extracting archive Image compression Speech compression Video compression Multimedia compression Minimum description length Minimum message length (two-part lossless compression designed for inference)
List of archive formats List of file archivers Comparison of file archivers List of Unix programs Free file format HTTP compression Reverse Delta Data compression topics Compression algorithms run-length encoding dictionary coders
- LZ77 & LZ78
LZW
Burrows-Wheeler transform
prediction by partial matching (also known as PPM)
context mixing
Dynamic Markov Compression (DMC)
entropy encoding
- Huffman coding (simple entropy coding; commonly used as the final stage of compression)
Adaptive Huffman coding
arithmetic coding (more advanced)
- Shannon-Fano coding
range encoding (same as arithmetic coding, but looked at in a slightly different way)
T-code, A variant of Huffman code
Golomb coding (simple entropy coding for infinite input data with a geometric distribution)
universal codes (entropy coding for infinite input data with an arbitrary distribution)
- Elias gamma coding
Fibonacci coding Lossless data compression
discrete cosine transform
fractal compression
- fractal transform
wavelet compression
vector quantization
linear predictive coding
Distributed Source Coding Using Syndromes, for correlated data
Modulo-N code for correlated data
A-law Compander
Mu-law Compander Example implementations
Data collections, commonly used for comparing compression algorithms.
Canterbury Corpus
Calgary Corpus