The Problem
Shannon was working at Bell Labs on the fundamental problem of communication: how do you transmit messages efficiently over a noisy channel? To answer this, he first needed a way to measure how much information a source produces.
Think of receiving a stream of symbols. If every symbol is always "A", there is no surprise — you already know what's coming, so the information content is zero. But if each of 26 letters is equally likely, every new letter carries maximum uncertainty. Shannon needed a single number to capture this "average surprise."
The Three Axioms
Rather than proposing a formula and hoping it works, Shannon took a different approach. He wrote down three properties that any reasonable measure of uncertainty must satisfy, then asked: what functions meet all three?
Continuity
The measure should change smoothly. A tiny adjustment to any probability should produce only a tiny change in the uncertainty value — no sudden jumps or discontinuities.
Monotonicity
When all outcomes are equally likely, more possible outcomes means more uncertainty. Choosing from 26 letters is more uncertain than choosing from 2. The measure should increase with the number of equally-probable options.
Composition (Recursion)
If a choice can be decomposed into successive sub-choices, the total uncertainty must equal the weighted sum of the sub-choice uncertainties. For example, choosing one letter from {A, B, C} is equivalent to first choosing {A} vs {B or C}, then (if needed) choosing between {B} and {C}. The measure must give the same answer either way.
The Uniqueness Theorem
Shannon then proved — mathematically, not by intuition — that there is exactly one function satisfying all three axioms:
where pᵢ is the probability of outcome i, and K is a positive constant whose choice determines the unit of measurement.
When K = 1 and the logarithm uses base 2, entropy is measured in bits — the number of yes/no questions needed, on average, to identify which outcome occurred. When the natural logarithm (ln) is used, the unit is called a nat. The choice of base doesn't affect the behaviour of the measure, only its scale.
This is a uniqueness result, not an invention. Shannon didn't choose this formula because it was convenient. He proved that no other function satisfies the three axioms simultaneously. Anyone starting from the same reasonable requirements would inevitably arrive at the same formula.
What the Formula Captures
Entropy simultaneously measures two things:
| Property | Meaning | Example |
|---|---|---|
| Richness | How many distinct outcomes have non-zero probability | A forest with 20 species vs only 3 |
| Evenness | How equally distributed the probabilities are | Equal populations across 20 species vs 80% from one dominant species |
Entropy is maximised when all outcomes are equally likely (H = log n for n outcomes), and it equals zero when the outcome is certain (one probability = 1, all others = 0). Between these extremes, it smoothly quantifies how "spread out" the distribution is.
From Telephones to Ecosystems
Shannon's entropy was designed for communication channels, but because it captures the abstract notion of distributional evenness, it found applications far beyond:
Information theory. Shannon publishes "A Mathematical Theory of Communication" at Bell Labs, defining entropy as the fundamental measure of information content.
Statistical mechanics. The connection to Boltzmann–Gibbs entropy in physics is formalised — the same formula describes disorder in physical systems.
Ecology. Robert MacArthur and others begin using Shannon's H to measure biodiversity — how many species are present and how evenly distributed they are in a habitat.
The key takeaway: Wherever you need to measure how evenly something is distributed across categories — species in a habitat, symbols in a message, energy states in a system — Shannon's entropy is the unique, axiomatically justified measure. Its power lies not in being a clever invention, but in being the only possible answer to a well-posed question.
Reference: Shannon, C.E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.