Huffman coding

In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly used for lossless data compression. The process of finding or using such a code is Huffman coding, an algorithm developed by David A. Huffman while he was a Sc.D. student at MIT, and published in the 1952 paper "A Method for the Construction of Minimum-Redundancy Codes".^[1]

The output from Huffman's algorithm can be viewed as a variable-length code table for encoding a source symbol (such as a character in a file). The algorithm derives this table from the estimated probability or frequency of occurrence (weight) for each possible value of the source symbol. As in other entropy encoding methods, more common symbols are generally represented using fewer bits than less common symbols. Huffman's method can be efficiently implemented, finding a code in time linear to the number of input weights if these weights are sorted.^[2] However, although optimal among methods encoding symbols separately, Huffman coding is not always optimal among all compression methods - it is replaced with arithmetic coding^[3] or asymmetric numeral systems^[4] if a better compression ratio is required.

History

In 1951, David A. Huffman and his MIT information theory classmates were given the choice of a term paper or a final exam. The professor, Robert M. Fano, assigned a term paper on the problem of finding the most efficient binary code. Huffman, unable to prove any codes were the most efficient, was about to give up and start studying for the final when he hit upon the idea of using a frequency-sorted binary tree and quickly proved this method the most efficient.^[5]

In doing so, Huffman outdid Fano, who had worked with Claude Shannon to develop a similar code. Building the tree from the bottom up guaranteed optimality, unlike the top-down approach of Shannon–Fano coding.

Terminology

Huffman coding uses a specific method for choosing the representation for each symbol, resulting in a prefix code (sometimes called "prefix-free codes", that is, the bit string representing some particular symbol is never a prefix of the bit string representing any other symbol). Huffman coding is such a widespread method for creating prefix codes that the term "Huffman code" is widely used as a synonym for "prefix code" even when such a code is not produced by Huffman's algorithm.

Problem definition

Informal description

Given: A set of symbols and their weights (usually proportional to probabilities).
Find: A prefix-free binary code (a set of codewords) with minimum expected codeword length (equivalently, a tree with minimum weighted path length from the root).

Formalized description

Input.
Alphabet $A=(a_{1},a_{2},\dots ,a_{n})$ , which is the symbol alphabet of size $n$ .
Tuple $W=(w_{1},w_{2},\dots ,w_{n})$ , which is the tuple of the (positive) symbol weights (usually proportional to probabilities), i.e. $w_{i}=\operatorname {weight} \left(a_{i}\right),\,i\in \{1,2,\dots ,n\}$ .

Output.
Code $C\left(W\right)=(c_{1},c_{2},\dots ,c_{n})$ , which is the tuple of (binary) codewords, where $c_{i}$ is the codeword for $a_{i},\,i\in \{1,2,\dots ,n\}$ .

Goal.
Let ${\textstyle L\left(C\left(W\right)\right)=\sum _{i=1}^{n}{w_{i}\operatorname {length} \left(c_{i}\right)}}$ be the weighted path length of code $C$ . Condition: $L\left(C\left(W\right)\right)\leq L\left(T\left(W\right)\right)$ for any code $T\left(W\right)$ .

Example

We give an example of the result of Huffman coding for a code with five characters and given weights. We will not verify that it minimizes L over all codes, but we will compute L and compare it to the Shannon entropy H of the given set of weights; the result is nearly optimal.

Input (A, W)	Symbol ( $a i$ )	a	b	c	d	e	Sum
Input (A, W)	Weights ( $w i$ )	0.10	0.15	0.30	0.16	0.29	= 1
Output C	Codewords ( $c i$ )	`010`	`011`	`11`	`00`	`10`
	Codeword length (in bits) ( $l i$ )	3	3	2	2	2
	Contribution to weighted path length ( $l i$ $w i$ )	0.30	0.45	0.60	0.32	0.58	L(C) = 2.25
Optimality	Probability budget ( $2 - l i$ )	1/8	1/8	1/4	1/4	1/4	= 1.00
	Information content (in bits) ( $-log 2 w i$ ) ≈	3.32	2.74	1.74	2.64	1.79
	Contribution to entropy ( $- w i log 2 w i$ )	0.332	0.411	0.521	0.423	0.518	H(A) = 2.205

For any code that is biunique, meaning that the code is uniquely decodeable, the sum of the probability budgets across all symbols is always less than or equal to one. In this example, the sum is strictly equal to one; as a result, the code is termed a complete code. If this is not the case, one can always derive an equivalent code by adding extra symbols (with associated null probabilities), to make the code complete while keeping it biunique.

As defined by Shannon (1948), the information content h (in bits) of each symbol a_i with non-null probability is

h(a_{i})=\log _{2}{1 \over w_{i}}.

The entropy H (in bits) is the weighted sum, across all symbols $a i$ with non-zero probability $w i$ , of the information content of each symbol:

H(A)=\sum _{w_{i}>0}w_{i}h(a_{i})=\sum _{w_{i}>0}w_{i}\log _{2}{1 \over w_{i}}=-\sum _{w_{i}>0}w_{i}\log _{2}{w_{i}}.

(Note: A symbol with zero probability has zero contribution to the entropy, since $\lim _{w\to 0^{+}}w\log _{2}w=0$

Navigácia: Veda >

Analytika
Antropológia
Aplikované vedy
Bibliometria
Dejiny vedy
Encyklopédie
Filozofia vedy
Forenzné vedy
Humanitné vedy
Knižničná veda
Kryogenika
Kryptológia
Kulturológia
Literárna veda
Medzidisciplinárne oblasti
Metódy kvantitatívnej analýzy
Metavedy
Metodika

Metodológia vedy
Náboženstvo a veda
Náučná literatúra
Podvody vo vede
Popularizácia vedy
Potravinárstvo
Prírodné vedy
Pseudoveda
Scientometria
Spoločenské vedy
Teórie
Teatrológia
Technické vedy
Technika
Terminológia
Umenie
Výskum

Veda
Veda a technika podľa štátu
Veda a technika podľa kontinentu
Veda a technika podľa roka
Veda v kozme
Vedci
Vedecká literatúra
Vedecké databázy
Vedecké experimenty
Vedecké konferencie
Vedecké metódy
Vedecké ocenenia
Vedecké organizácie
Vedecké parky
Vedeckí spisovatelia
Vzdelávanie
Záhady

Príbuzné výrazy:

Text je dostupný za podmienok Creative Commons Attribution/Share-Alike License 3.0 Unported; prípadne za ďalších podmienok.
Podrobnejšie informácie nájdete na stránke Podmienky použitia.

[1]

[2]

[3]

[4]

[5]

Char	Freq	Code
space	7	111
a	4	010
e	4	000
f	3	1101
h	2	1010
i	2	1000
m	2	0111
n	2	0010
s	2	1011
t	2	0110
l	1	11001
o	1	00110
p	1	10011
r	1	11000
u	1	00111
x	1	10010