Information Theory

Resources

Basics

Fork in the road example. Without any prior knowledge, i.e. 50/50 guess whether to go right or left, the information of "right" provides 1 bit of information. With n total forks, this makes m=2n possible destinations. So to express a whole number m, we need n=log2(m) bits of information.

For a coin-flip with outcome xh, the associated Shannon information or surprisal is

information=log21p(xh) bits.

Or equivalently, information=log2p(xh) bits. For instance, with a 50/50 guess, p(xh)=0.5 and this gives log2(0.5)=1 bits. We can verify that the above is an example of this, where if each number is equally probable, and we observe one number, then we have the associated information as

log21m=log212n=log22n=n.

The naming of surprisal sometimes makes more sense. Suppose the coin is unfair, such that p(xh)=0.9 (and we know this). Then, the surprisal of xh is log20.9=0.152. We are less surprised with this is the outcome than if the coin is fair and we observe xh. Conversely, we are more surprised to observe xt, rating this as log20.1=3.32.

The definition of information means:

Entropy, Self Information

Entropy is the expected value of the information, denoted H(X), for random variable X:

ExX log1p(x).

For a fair coin, p(xh)=p(xt)=0.5, and log evaluates to 1 for both of these. The average over both is 1=H(X). If p(xh)=0.9 and p(xt)=0.1, log evaluates to 0.15 and 3.32, which has a weighted average of 0.469 bits = H(X). This amount of information could represent 20.469=1.38 equiprobable numbers. Note that the entropy / self information of the unfair coin is less than the fair coin.

As another example, consider rolling and summing a pair of six-sided dice. While there are 11 outcomes, some are more likely than others, leading to an equivalent 9.65 equiprobable outcomes having the same entropy (average information).

A continuous variable provides bits of information. This is easily seen by considering that infinitely many decimal places need to be used to describe that number. And any continuous distribution consists of uncountably infinitely many of these infinite-precision variables. Thus, information of continuous variables / distributions is done by comparing it to something else... (e.g. KL-divergence). Also, we can easily see the connection above to how the discrete is really comparing against equiprobable events.

Cross-Entropy

The cross-entropy of two (discrete) probabiliity distributions p and q (over the same space X maps from Σ) is

H(p,q)=xp(x)log2q(x)=Explog2q(x).

This is our average surprisal if we believe x has distribution q, but it really has distribution p. This is not symmetric. The surprise can come from:

  1. Actual stochasticity of p, the true underlying distribution,
  2. Believing the wrong model, q, rather than p (always positive).
    Thus,
H(p,q)H(p).

Further, the KL-divergence, DKL(p,q), measures this second source of surprisal, by subtracting off the surprise inherent in p:

DKL(p,q)=H(p,q)H(p).

Often, we parameterize q (e.g. with Generative Modeling] such as VAEs). Minimizing DKL(p,q) over q gives the same q as minimizing H(p,q), so we often just use the cross-entropy as our loss function, leaving out the constant factor (the entropy / self-information of p).

Maximum Entropy Distributions

Shannon's Source Coding Theorem

Source has entropy H, channel has capacity C (a rate). Then we can encode the source so that we transmit at average rate C/Hε , where ε is arbitrary, positive. No information can be transmitted faster than C/H, on average.

It does not specify how to find the encoding, or how complicated it might be.
For instance, consider again the unfair coin (p(xh)=0.9). We had H=0.469 bits. For a fair coin, we have H=1 bits. An optimal encoding scheme for the fair coin is just a stream of 0 and 1 representing not heads and heads respectively. We could do this better for the unfair coin. If we were to use this same encoding, we would have a lot of consecutive 1s. So instead of sending those individually, create a message structure like this

00message00

then in the message, either include the number of zeros as consecutive tails, or include the number of heads written as binary. so instead of 1111111111111111, we would encode 001000000=001000000, which is many digits fewer. This coding scheme still has some ambiguity, so that would need to be fixed, but overall, we would average 47 bits to encode the 100 bit message for this unfair coin (and the encoding is not ambiguous).

We can define the capacity of a channel in terms of the max (or supremum?) of encoding distributions.

Mutual Information

I(X;Y)H(Y)H(Y|X)H(X)H(X|Y)

Y is our output from the (noisy) channel, and X is the input. So, H(Y) is the information of just the output. H(Y|X) is the information of the output of Y, given that we already have X. This is necessarily less than H(Y), as we don't have the hint from X about what to expect. For instance, if Y=X+η, and η=20 or η=30. If Y can have 6 values 120,130,220,230,320,330, then knowing Y=120 provides log26 bits of information. However, if we already know X=100, then we are only wondering whether Y=120 or Y=130. Thus, knowing Y provides only log22=1 bit of information. We can repeat the same example with all of these values, and if they are equal probability, then the mutual information will be given by log26log22. We can also check that this definition is symmetric. H(X)=log23 (3 options for X) and H(X|Y)=log21 (if we know Y, then we know X certainly - only one possible value, so it is 1 option, which is 0 information). Then, log23log21=log26log22=log23. So we have I(X;Y)=I(Y;X).

Again, the mutual information is how much information knowing X provides for determining Y. If the channel were to give Y=0X+η, then I(X;Y)=0. So, no matter what Y is, the worst it can do is provide no information I(X;Y)0.

Shannon's Noisy Channel Coding Theorem

Reducing errors/noise (e.g. via error correcting codes) reduces information communication rate. To reduce error near zero, it seems that we would need to make the communication rate also near zero, but this is not the case.

If H(X)C, there's a coding system such that H(X|Y) is arbitrarily small (i.e. so the uncertainty of X once we observe the output Y). If H(X)C, we can encode X so that H(X|Y)<HC+ε, where ε is arbitrary and positive. Thus, in either case H(X|Y)H(X)C.

Now, we can define the capacity of a noisy channel as the max/supremum of I(X;Y)=H(X)H(X|Y) over possible distributions of X.