Bayesian Inference and Graphical ModelsIntroduction
Consider the following two scenarios.
- You pull a coin out of your change purse and flip it five times. It comes up heads all five times.
- You meet a magician who flips a coin five times and shows you that it came up heads all five times.
In which situation would you be more inclined to be skeptical of the null hypothesis that the coin being flipped is a fair coin?
Solution. We'd be more inclined to be skeptical in the magician scenario, since it isn't unusual for a magician to have a trick coin or card deck. Given a random coin from our change purse, it is extraordinarily unlikely that the coin is actually significantly biased towards heads. Although it's also unlikely for a fair coin to turn up heads five times in a row, that isn't going to be enough evidence to be persuasive.
This example illustrates one substantial shortcoming of the statistical framework—called frequentism—used in our statistics course. Frequentism treats parameters as fixed constants rather than random variables, and as a result it does not allow for the incorporation of information we might have about the parameters beyond the data observed in the random experiment (such as the real-world knowledge that a magician is not so unlikely to have a double-headed coin).
Bayesian statistics is an alternative framework in which we do treat model parameters as random variables. We specify a prior distribution for a model's parameters, and this distribution is meant to represent what we believe about the parameters before we observe the results of the random experiment. Then the results of the experiment serve to update our beliefs, yielding a posterior distribution.
The theorem in probability which specifies how probability distributions update in light of new evidence is called
For example, if your prior assessment of the probability that the magician's coin is double-headed is 5%, then your posterior estimate of that probability after observing five heads would shoot up to
Meanwhile if the prior for double-headedness for the coin in your coin purse is , then the posterior is only .
The quantity is called the likelihood of the observed result. So we can summarize Bayes theorem with the mnemonic posterior is proportional to likelihood times prior.
Bayes rule takes an especially simple form when our distributions are supported on two values (for example, "fair" and "double-headed"), but we can apply the same idea to other probability mass functions as well as probability density functions.
Suppose that the heads probability of a coin is . Consider a uniform prior distribution for , and suppose that flips of the coin are observed. Express the posterior density in terms of the number of heads and tails in the observed sequence of flips.
Solution. We calculate the posterior density as likelihood times prior. Let's call the random sequence of flips, and suppose is a possible value of . We get
In this formula we are employing a common abuse of notation by using the same letter () for three different densities. For example, refers to the conditional density of given ; more precisely, it refers to the density of the conditional distribution of the random variable given the event , evaluated at the value . It might more written more conventionally as . Likewise, refers to the marginal distribution of , and so on.
The continuous distribution on whose density is proportional to is called the Beta distribution with parameters and . So the coin flip posterior for a uniform prior is a Beta distribution.
Show that the coin flip posterior for a Beta prior is also a Beta distribution. How does the evidence alter the parameters of the beta distribution.
Solution. If the prior density is proportional to , then the posterior distribution is proportional to , following the same calculation as above. In other words, each head in the observed sequence increments the parameter of the distribution, while each tail increments the parameter.
When the posterior distribution has the same parametric form as the prior distribution, this property is called conjugacy. For the example above, we say that the Beta distribution is a conjugate family for the binomial likelihood.