Suppose we have a coin that has a 3/4 chance of landing on heads (call this 0) and a 1/4 chance of landing on tails (1). Which of the 16-toss sequences below is most likely?
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0
- 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
We came across this question in an Information Theory lecture last week, and many of us thought the second sequence was most likely. It couldn’t be the third sequence since the coin is more likely to land on heads than on tails, so the first sequence is more likely than the third. Comparing the first and second sequences, we thought the second was more likely because it had 3/4 heads and 1/4 tails, which is what you’d expect from a coin that had a 3/4 chance of landing on heads and 1/4 chance of landing on tails.
We were wrong.
Mathematical Explanation
If you work out the probability of seeing each of these sequences we have:
Sequence 1 =
Sequence 2 =
Sequence 3 =
So the first sequence is times more likely than the second one! How can this be?
Intuitive Explanation
It’s a specific sequence, not the entire typical set
(Think of the typical set as the set of outcomes where the fraction of heads is similar to the probability of getting heads.)
The mistake in our intuition was considering the probability of all sequences with twelve heads and four tails as opposed to the specific sequence of twelve heads and four tails. The probability of getting any sequence with twelve heads and four tails is much larger:
This is times more likely than seeing the first sequence of all heads!
The tosses are independent
Here’s a second way of thinking about it. Look again at our two sequences:
- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
- 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0
Each toss is independent, i.e. the outcome of toss #3 is not affected by whether toss #2 (or any other toss) was heads or tails. So if we consider toss #4, where we get tails in the second sequence and continue to get heads in the first sequence, we’re actually three times as likely to get heads than tails. (since ). We have four tails in the second sequence, so the first sequence is
times as likely as the second.
The remarkable thing about typical sets
What we considered just now – the set of all 16-toss sequences with 12 heads and 4 tails – is an example of a typical set. It’s ‘typical’ because the proportion of the heads is approximately equal to the probability of getting a head, as you’d expect.
The remarkable thing is even though the number of elements in the typical set is much smaller than the total number of possible sequences, the probability of a generated sequence being in the typical set is high.
(If that seems confusing, think of it like there being fifty people in your class. Each lesson you are allocated a partner randomly (perhaps not with equal probability), where your partner this lesson does not depend at all on who you were partnered with in previous lessons. Somehow, out of fifty lessons, you get paired with person A eleven times. Weird, huh? There is likely something interesting about the way students are paired.)
- In this example, there are
possible sequences. There are
sequences in our typical set. That means only around 2.8% of all sequences are in our typical set. But the probability a generated sequence you’ll see belongs to our typical set is a whopping 22.5%!
This has implications for how we encode information and can help us compress data more effectively. (More on that in a later post.)
A bit more technical: Typical sets and entropy (recognise that from neural nets?)
Disclaimer: I’ve been discussing ‘the typical set’, but what I actually mean is the typical set with and sequence length
.
A typical set is actually a set of sequences with length
whose probability is concentrated around
, where
is the entropy of the probability distribution
- The entropy of a random variable is the amount of uncertainty in the outcome of that random variable.
- Related: categorical cross-entropy is the loss function often used for neural network classification tasks!
is related to the amount of deviation in probability from
we allow the set of sequences to have.
Thus there are many typical sets we can consider.
There some beautiful results about typical sets. E.g. as the sequence length gets longer, the probability a generated sequence is in the typical set becomes closer to 1. We can also put upper and lower bounds on the number of elements in the typical set in general.
In summary
Each individual sequence has low probability, but the fraction-of-1s-proportional-to-probability-mass/density sequences as a set have high probability, as you’d expect.
The remarkable thing is even though the number of elements in the typical set is much smaller than the total number of possible sequences, the probability of a generated sequence being in the typical set is high.
Credits to Dr. Ramji Venkataramanan for using this example in a 3F7 Information Theory lecture.