Fork in the road example. Without any prior knowledge, i.e. 50/50 guess whether to go right or left, the information of "right" provides 1 bit of information. With total forks, this makes possible destinations. So to express a whole number , we need bits of information.
For a coin-flip with outcome , the associated Shannon information or surprisal is
Or equivalently, For instance, with a 50/50 guess, and this gives . We can verify that the above is an example of this, where if each number is equally probable, and we observe one number, then we have the associated information as
The naming of surprisal sometimes makes more sense. Suppose the coin is unfair, such that (and we know this). Then, the surprisal of is . We are less surprised with this is the outcome than if the coin is fair and we observe . Conversely, we are more surprised to observe , rating this as .
The definition of information means:
A certain event has no () information. "The sun will rise tomorrow" (), gives no information, and this is reflected in the definition, .
Repeated events give additive surprisal/information. Two heads in a row is twice as surprising as one head in a row: .
Entropy, Self Information
Entropy is the expected value of the information, denoted , for random variable :
For a fair coin, , and evaluates to 1 for both of these. The average over both is . If and , evaluates to and , which has a weighted average of bits = . This amount of information could represent equiprobable numbers. Note that the entropy / self information of the unfair coin is less than the fair coin.
As another example, consider rolling and summing a pair of six-sided dice. While there are 11 outcomes, some are more likely than others, leading to an equivalent 9.65 equiprobable outcomes having the same entropy (average information).
A continuous variable provides bits of information. This is easily seen by considering that infinitely many decimal places need to be used to describe that number. And any continuous distribution consists of uncountably infinitely many of these infinite-precision variables. Thus, information of continuous variables / distributions is done by comparing it to something else... (e.g. KL-divergence). Also, we can easily see the connection above to how the discrete is really comparing against equiprobable events.
Cross-Entropy
The cross-entropy of two (discrete) probabiliity distributions and (over the same space maps from ) is
This is our average surprisal if we believe has distribution , but it really has distribution . This is not symmetric. The surprise can come from:
Actual stochasticity of , the true underlying distribution,
Believing the wrong model, , rather than (always positive).
Thus,
Further, the KL-divergence, , measures this second source of surprisal, by subtracting off the surprise inherent in :
Often, we parameterize (e.g. with Generative Modeling] such as VAEs). Minimizing over gives the same as minimizing , so we often just use the cross-entropy as our loss function, leaving out the constant factor (the entropy / self-information of ).
Maximum Entropy Distributions
Gaussian: maximum entropy for both and , if is normal. I.e. we want to encode our signal as normally distributed for maximal entropy/information received.
Exponential: bound on one side
Uniform: bounds on two sides
Not sure how to think of information for continuous variables? I get the idea that it should be relative.
Shannon's Source Coding Theorem
Source has entropy , channel has capacity (a rate). Then we can encode the source so that we transmit at average rate , where is arbitrary, positive. No information can be transmitted faster than , on average.
It does not specify how to find the encoding, or how complicated it might be.
For instance, consider again the unfair coin (). We had bits. For a fair coin, we have bits. An optimal encoding scheme for the fair coin is just a stream of and representing not heads and heads respectively. We could do this better for the unfair coin. If we were to use this same encoding, we would have a lot of consecutive s. So instead of sending those individually, create a message structure like this
then in the , either include the number of zeros as consecutive tails, or include the number of heads written as binary. so instead of , we would encode , which is many digits fewer. This coding scheme still has some ambiguity, so that would need to be fixed, but overall, we would average bits to encode the bit message for this unfair coin (and the encoding is not ambiguous).
We can define the capacity of a channel in terms of the max (or supremum?) of encoding distributions.
Mutual Information
is our output from the (noisy) channel, and X is the input. So, is the information of just the output. is the information of the output of , given that we already have . This is necessarily less than , as we don't have the hint from about what to expect. For instance, if , and or . If can have 6 values , then knowing provides bits of information. However, if we already know , then we are only wondering whether or . Thus, knowing provides only bit of information. We can repeat the same example with all of these values, and if they are equal probability, then the mutual information will be given by . We can also check that this definition is symmetric. (3 options for ) and (if we know , then we know certainly - only one possible value, so it is 1 option, which is information). Then, . So we have .
Again, the mutual information is how much information knowing provides for determining . If the channel were to give , then . So, no matter what is, the worst it can do is provide no information
Shannon's Noisy Channel Coding Theorem
Reducing errors/noise (e.g. via error correcting codes) reduces information communication rate. To reduce error near zero, it seems that we would need to make the communication rate also near zero, but this is not the case.
If , there's a coding system such that is arbitrarily small (i.e. so the uncertainty of once we observe the output ). If , we can encode so that , where is arbitrary and positive. Thus, in either case .
Now, we can define the capacity of a noisy channel as the max/supremum of over possible distributions of .