Yesterday, I watched Peter Donnelly’s TED presentation on statistical mind-benders. Among other things (statistician jokes!), Peter observes that humans don’t have good intuition for some kinds of statistical thinking. In the presentation, Donelly posses a coin toss problem to demonstrate his point.  He chooses one that is easy to get wrong.

Consider a series of fair coin tosses. For example, one possible sequence of coin tosses is THTTHHTHTTH. How many tosses are required to get a particular pattern? How does this depend on the length of the pattern?

Peter poses a concrete question as follows.  Consider the pattern HTH. If we do the experiment of tossing a coin repeatedly and counting the number of tosses, we find that the first occurrence of HTH arises in some average number of coin tosses . For a different pattern, say TTH, we can repeat the experiment and find that the first occurrence of this pattern arises in some average number of coin tosses .

One of the following statements must be true:

(a) (b) (c) Which statement is correct?

My reflex reaction was (a). The heuristic leading to this conclusion is that the probability of getting TTH is the same as the probability of getting HTH in any 3 coin tosses, i.e., .

On the other hand, if the pattern was HHH the probability of getting this pattern on any three coin tosses is the same. But intuitively, I expect to have to make more coin tosses on average to get this pattern.

It is difficult to get to the correct answer with this kind of reasoning.

A little counter intuitively, (b) is the answer. It takes more tosses on average to get the pattern HTH than the pattern TTH. Peter spends some time arguing that this is plausible–watch his presentation to get those arguments. Below, I pursue calculating this for myself…

Remember that the average or expected value of n is calculated by, The probability of any single coin toss sequence is where n is the length of the sequence.

An approach to calculating P(n) is to count the number of permutations of coin-toss sequences of length 0 to n that contain the target pattern only in the last position. Each of these sequences has the length-dependent probability above, so that .

To take into account that some target patterns can overlap themselves, I force the last (l-1) places in the sequence to match the first (l-1) places of the pattern.  The chance of getting the pattern on the next toss is1/2.  So multiply by 1/2 to get the formula above.

The table below shows the function N(n-1) for the two patterns HTH and TTH. It is easy to see that for longer sequences, there are many more possible sequences not containing HTH than HHT. Since the mean is calculated as a sum of N(n-1) multiplied by the probability which only depends on the length of the sequence, HTH is going to have the greater expected n.

You can use a pad a paper to count sequences, but this gets out of hand quickly. I wrote a Python script to help keep things sorted out. This script makes all the calculations shown for the rest of this entry and can be downloaded here.

 Tosses Coin Toss SequencePermutations W/OPattern (TTH) Coin Toss SequencePermutations W/OPattern (HTH) 4 2 2 5 4 3 6 7 5 7 12 9 8 20 16 9 33 28 10 54 49 11 88 86 12 143 151 13 232 265 14 376 465 15 609 816 16 986 1432 17 1596 2513 18 2583 4410 19 4180 7739 20 6764 13581 21 10945 23833 22 17710 41824 23 28656 73396 24 46367 128801 25 75024 226030

It is easy to use these counts and the formula above to calculate the expected value of n for short patterns. Here is an example for a pattern of length 2. This calculation applies to either TH or HT and shows that the expected value of n is 4.00.

 CoinTosses Coin Toss Sequenceswithout Pattern (TH or HT) Average Coin Tosses(successive approximations) 3 2 1.250 4 3 2.000 5 4 2.625 6 5 3.094 7 6 3.422 8 7 3.641 9 8 3.781 10 9 3.869 11 10 3.923 12 11 3.955 13 12 3.974 14 13 3.985 15 14 3.992 16 15 3.995 17 16 3.997 18 17 3.999 19 18 3.999 20 19 4.000

This series converges slowly and the numbre of permutations becomes very large for longer patterns. So, although this brute-force summing is straight forward, it runs up against practical limits surprisingly quickly.

Another approach to calculated the average n for a pattern is to use a Monte Carlo simulation of the coin toss events. The Python script above also performs this calculation. The table below shows the results for patterns up to length 4.

 Pattern (Est.) E Patterns of length 1 T 2.00 H 1.98 Patterns of length 2 TT 5.99 HH 6.00 HT 4.02 TH 4.01 Patterns of length 3 TTT 13.95 HHH 14.07 HTT 8.03 THH 8.02 THT 9.95 HTH 10.04 HHT 8.00 TTH 8.06 Patterns of length 4 TTTT 30.07 HHHH 30.15 HTTT 16.10 THHH 16.07 THTT 17.96 HTHH 17.98 HHTT 15.92 TTHH 16.16 TTHT 18.09 HHTH 18.00 HTHT 20.14 THTH 19.94 THHT 18.12 HTTH 18.20 HHHT 16.13 TTTH 16.0

This table is useful to compare the different results for various patterns of the same length.

Finally, the Monte Carlo method gives an estimate of the distribution of n for the patterns. (The numbers above are based on 20,000 sequences.) The histogram below shows the Distribution of Sequence length for the pattern TTTH.  The average sequence length is estimated to be 16 (from the table above). Because the distribution is asymmetric and has a very long tail to the right, the average doesn’t appear near any interesting features (the peak, for example) of the distribution. 2 Comments leave one →