Information theory and Shannon's source coding theorem

Entry 001 · 2026-01-25

Shannon's 1948 paper introduced both information theory and a specific quantitative statement about lossless compression: a source emitting symbols drawn i.i.d. from a distribution with entropy H can be encoded into a binary string of expected length arbitrarily close to H bits per symbol, but no scheme can do better in the long run.

The proof is in two parts. The achievability side is constructive: build a typical-set encoder. For block length n large enough, almost all observed sequences belong to a set of size roughly 2^{nH}; index that set and the index needs about nH bits. The converse side is a counting argument: if you have fewer than 2^{nH} distinct codewords, you cannot uniquely encode sequences from a set whose probability mass concentrates on 2^{nH} elements without introducing decoding errors.

The theorem is asymptotic and makes no statement about specific short messages — that's the territory of Kolmogorov complexity. It also says nothing about computability of the optimal encoder; the typical-set encoder is only mildly practical and was quickly superseded by Huffman codes (which are exactly optimal for single-symbol encodings) and later by arithmetic codes (which are essentially optimal at the block level and trivially adaptive).

The result still anchors a lot of modern compression intuition. Whenever you see a compressor advertised as "matching the entropy of the source", what is being matched is the per-symbol entropy under some assumed model; the gap to Shannon's bound is a measure of how well the model fits the actual source. The bound itself is not the achievement; the achievement is reducing the engineering problem of compression to the modelling problem of estimating H.