Information Theory

Resources

Uncertainty Quantification

Basics

Fork in the road example. Without any prior knowledge, i.e. 50/50 guess whether to go right or left, the information of "right" provides 1 bit of information. With $n$ total forks, this makes $m = 2^{n}$ possible destinations. So to express a whole number $m$ , we need $n = \log_{2} (m)$ bits of information.

For a coin-flip with outcome $x_{h}$ , the associated Shannon information or surprisal is

information = \log_{2} \frac{1}{p (x_{h})} bits .

Or equivalently, $information = - \log_{2} p (x_{h}) bits .$ For instance, with a 50/50 guess, $p (x_{h}) = 0.5$ and this gives $- \log_{2} (0.5) = 1 bits$ . We can verify that the above is an example of this, where if each number is equally probable, and we observe one number, then we have the associated information as

- \log_{2} \frac{1}{m} = - \log_{2} \frac{1}{2^{n}} = - \log_{2} 2^{- n} = n .

The naming of surprisal sometimes makes more sense. Suppose the coin is unfair, such that $p (x_{h}) = 0.9$ (and we know this). Then, the surprisal of $x_{h}$ is $- \log_{2} 0.9 = 0.152$ . We are less surprised with this is the outcome than if the coin is fair and we observe $x_{h}$ . Conversely, we are more surprised to observe $x_{t}$ , rating this as $- \log_{2} 0.1 = 3.32$ .

The definition of information means:

A certain event has no ( $0$ ) information. "The sun will rise tomorrow" ( $p = 1$ ), gives no information, and this is reflected in the definition, $- \log_{2} 1 = 0$ .
Repeated events give additive surprisal/information. Two heads in a row is twice as surprising as one head in a row: $- \log_{2} p (x_{h})^{2} = - 2 \log_{2} p (x_{h})$ .

Entropy, Self Information

Entropy is the expected value of the information, denoted $H (X)$ , for random variable $X$ :

E_{x \sim X} \log \frac{1}{p (x)} .

For a fair coin, $p (x_{h}) = p (x_{t}) = 0.5$ , and $- \log$ evaluates to 1 for both of these. The average over both is $1 = H (X)$ . If $p (x_{h}) = 0.9$ and $p (x_{t}) = 0.1$ , $- \log$ evaluates to $0.15$ and $3.32$ , which has a weighted average of $0.469$ bits = $H (X)$ . This amount of information could represent $2^{0.469} = 1.38$ equiprobable numbers. Note that the entropy / self information of the unfair coin is less than the fair coin.

As another example, consider rolling and summing a pair of six-sided dice. While there are 11 outcomes, some are more likely than others, leading to an equivalent 9.65 equiprobable outcomes having the same entropy (average information).

A continuous variable provides $\infty$ bits of information. This is easily seen by considering that infinitely many decimal places need to be used to describe that number. And any continuous distribution consists of uncountably infinitely many of these infinite-precision variables. Thus, information of continuous variables / distributions is done by comparing it to something else... (e.g. KL-divergence). Also, we can easily see the connection above to how the discrete is really comparing against equiprobable events.

Cross-Entropy

The cross-entropy of two (discrete) probabiliity distributions $p$ and $q$ (over the same space $X$ maps from $Σ$ ) is

H (p, q) = \sum_{x} - p (x) \log_{2} q (x) = \underset{x \sim p}{E} - \log_{2} q (x) .

This is our average surprisal if we believe $x$ has distribution $q$ , but it really has distribution $p$ . This is not symmetric. The surprise can come from:

Actual stochasticity of $p$ , the true underlying distribution,
Believing the wrong model, $q$ , rather than $p$ (always positive).
Thus,

H (p, q) \geq H (p) .

Further, the KL-divergence, $D_{K L} (p, q)$ , measures this second source of surprisal, by subtracting off the surprise inherent in $p$ :

D_{K L} (p, q) = H (p, q) - H (p) .

Often, we parameterize $q$ (e.g. with Generative Modeling] such as VAEs). Minimizing $D_{K L} (p, q)$ over $q$ gives the same $q$ as minimizing $H (p, q)$ , so we often just use the cross-entropy as our loss function, leaving out the constant factor (the entropy / self-information of $p$ ).

Maximum Entropy Distributions

Gaussian: maximum entropy for both $X$ and $Y$ , if $η$ is normal. I.e. we want to encode our signal as normally distributed for maximal entropy/information received.
Exponential: bound on one side
Uniform: bounds on two sides
Not sure how to think of information for continuous variables? I get the idea that it should be relative.

Shannon's Source Coding Theorem

Source has entropy $H$ , channel has capacity $C$ (a rate). Then we can encode the source so that we transmit at average rate $C / H - ε$ , where $ε$ is arbitrary, positive. No information can be transmitted faster than $C / H$ , on average.

It does not specify how to find the encoding, or how complicated it might be.
For instance, consider again the unfair coin ( $p (x_{h}) = 0.9$ ). We had $H = 0.469$ bits. For a fair coin, we have $H = 1$ bits. An optimal encoding scheme for the fair coin is just a stream of $0$ and $1$ representing not heads and heads respectively. We could do this better for the unfair coin. If we were to use this same encoding, we would have a lot of consecutive $1$ s. So instead of sending those individually, create a message structure like this

00 - m e s s a g e - 00

then in the $m e s s a g e$ , either include the number of zeros as consecutive tails, or include the number of heads written as binary. so instead of $1111111111111111$ , we would encode $00 - 10000 - 00 = 001000000$ , which is many digits fewer. This coding scheme still has some ambiguity, so that would need to be fixed, but overall, we would average $47$ bits to encode the $100$ bit message for this unfair coin (and the encoding is not ambiguous).

We can define the capacity of a channel in terms of the max (or supremum?) of encoding distributions.

Mutual Information

I (X; Y) \equiv H (Y) - H (Y | X) \equiv H (X) - H (X | Y)

$Y$ is our output from the (noisy) channel, and X is the input. So, $H (Y)$ is the information of just the output. $H (Y | X)$ is the information of the output of $Y$ , given that we already have $X$ . This is necessarily less than $H (Y)$ , as we don't have the hint from $X$ about what to expect. For instance, if $Y = X + η$ , and $η = 20$ or $η = 30$ . If $Y$ can have 6 values $120, 130, 220, 230, 320, 330$ , then knowing $Y = 120$ provides $\log_{2} 6$ bits of information. However, if we already know $X = 100$ , then we are only wondering whether $Y = 120$ or $Y = 130$ . Thus, knowing $Y$ provides only $\log_{2} 2 = 1$ bit of information. We can repeat the same example with all of these values, and if they are equal probability, then the mutual information will be given by $\log_{2} 6 - \log_{2} 2$ . We can also check that this definition is symmetric. $H (X) = \log_{2} 3$ (3 options for $X$ ) and $H (X | Y) = \log_{2} 1$ (if we know $Y$ , then we know $X$ certainly - only one possible value, so it is 1 option, which is $0$ information). Then, $\log_{2} 3 - \log_{2} 1 = \log_{2} 6 - \log_{2} 2 = \log_{2} 3$ . So we have $I (X; Y) = I (Y; X)$ .

Again, the mutual information is how much information knowing $X$ provides for determining $Y$ . If the channel were to give $Y = 0 \cdot X + η$ , then $I (X; Y) = 0$ . So, no matter what $Y$ is, the worst it can do is provide no information $⟹ I (X; Y) \geq 0.$

Shannon's Noisy Channel Coding Theorem

Reducing errors/noise (e.g. via error correcting codes) reduces information communication rate. To reduce error near zero, it seems that we would need to make the communication rate also near zero, but this is not the case.

If $H (X) \leq C$ , there's a coding system such that $H (X | Y)$ is arbitrarily small (i.e. so the uncertainty of $X$ once we observe the output $Y$ ). If $H (X) \geq C$ , we can encode $X$ so that $H (X | Y) < H - C + ε$ , where $ε$ is arbitrary and positive. Thus, in either case $H (X | Y) \geq H (X) - C$ .

Now, we can define the capacity of a noisy channel as the max/supremum of $I (X; Y) = H (X) - H (X | Y)$ over possible distributions of $X$ .