Bayes theorem is easy to prove, hard to understand

5 min read


The Bayes theorem states

$${\color{Red}P(A | B)} = \cfrac{{\color{Green}P(B | A)}{\color{Blue}P(A)}}{{\color{Magenta}P(B)}}$$

The proof is not difficult to derive and can be done by using the definition of joint and conditional probability:

$$P(A, B) = {\color{Red}P(A | B)}{\color{Magenta}P(B)} = {\color{Green}P(B | A)}{\color{Blue}P(A)} = P(B, A)$$

In the following, instead of inexpressive notations $A$ and $B$, we do ourselves a small favor by rewriting the formula with

$${\color{Red}P(\text{Parameter} | \text{Data})} = \cfrac{{\color{Green}P(\text{Data} | \text{Parameter})} {\color{Blue}P(\text{Parameter})}}{{\color{Magenta}P(\text{Data})}}$$

The new formula has the same meaning but is a bit easier to reason with. We also make the tour easier by explaining the bayesian concepts with an example of an unfair coin. In this scenario, the term parameter indicates the probability of that certain coin lands on it tail.

How to interpret the Posterior?

The posterior ${\color{Red}P(\text{Parameter} | \text{Data})}$ assigns every possible parameter a probability by considering the given data. It displays our new belief about a parameter after being informed about the new data. For example:

  • For a certain coin, how plausible is the probability of $0.75$ of it landing on tail, given a dataset of 10 tosses and 9 heads of the same coin? Or formally written: $${\color{Red}P(\text{Probability landing on tail} = 0.75 | \text{Data = 10 tosses, 9 heads})} = ???$$

For comparing purpose, in the classical statistics we want the answer to the question: what is the underlying parameter of the distribution. And we try to answer that question by repeating the experiments to make an estimation of the true parameter. And the true parameter can not be changed, therefore with enough experiments, we can make an accurate enough estimation of the true parameter.

The bayesian side treats parameter of a distribution as a distribution itself and assign each parameter a probability. And since a parameter can have a probability, there is no true parameter. With new data incoming, we update our belief and change the way we think of a problem. Instead of having only a single point-estimator like in classical statistics, with Bayes rule we can also say how proper an estimator is.

How to interpret the Likelihood?

The likelihood ${\color{Green}P(\text{Data} | \text{Parameter})}$ gives probability that the seen data was generated by a certain model. For example:

  • What is the probability of 9 out of 10 tosses of a coin are head if we know that 3 out of 4 times, the coin should land on tail? Or formally $${\color{Green}P(\text{Data = 10 tosses, 9 heads} | \text{Probability landing on tail} = 0.75)} = ???$$ and since we are talking about a bernoulli distribution, we can answer the question by applying $${\color{Green}P(\text{Data = 10 tosses, 9 heads} | \text{Probability landing on tail} = 0.75) = {10 \choose 9} 9^{0.25} \cdot 1^{0.75} = 0.00002}$$

In other words, the probability of that coin lands $9$ out of $10$ times on its head given a $0.75$ bias is extraordinary small. As we can see in the Bayes' Rule, the likelihood term influences the posterior directly. But the likelihood alone is not enough to make a statement about posterior.

How to interpret Prior?

The conditional probability prior is the where Bayes statistics becomes really useful. As its name indicates, prior is our expression of belief for a parameter before we collect the data. The likelihood of the seen data alone should not be enough to change our mind, we also have to consider what we already know about the coin.

Think about our coin tossing example, $9$ out of $10$ tosses of a coin are head is really improper. Should we therefore go on and declare the coin to has a head bias? Not without considering the prior knowledge, states Bayes. For example:

  • Assume the coin was tested once in the past, where we tossed it $1000$ times and got $750$ tails? Would we still be confident with our decision of declaring the coin to be head biased; just because the new experiment of 10 tosses would say so? Suddenly we can not be so sure anymore, maybe the new data was just a flux?

The question remains, how can we express our belief in a probability of ${\color{Blue}P(\text{Parameter})}$? And if we can express this prior belief, how strong do we believe in this prior?

As for bernoulli distribution, calculating the prior can be done in closed form by using the Beta distribution. In particular, if we have done an experiment in the past and would like to use this experiment as prior knowledge for the new experiment, we can define a probability density function like following

$${\color{blue}f(\text{Parameter}) = \cfrac{\text{Parameter}^{\text{heads} - 1} (1 - \text{Parameter})^{\text{tails} - 1}}{B(\text{heads}, \text{tails})}}$$

where $B(\text{heads}, \text{tails})$ is the normalizing constant of the nominator. The term ${\color{blue}f(\text{Parameter})}$ in this case is the desired prior knowledge, packed compactly as an easy to evaluate function.

In more complex case, expressing prior knowledge is not always that easy.

How to interpret Evidence?

The denominator of Bayes' rule is also well-known as Marginal Likelihood. Formally, it is the expected value of the data’s probability. In case of bernoulli distribution, a small integration over all possible parameters can be used to evaluate the evidence

$${\color{magenta}P(\text{Data}) = \int P(\text{Data} | \text{Parameter}) P(\text{Parameter}) d_{\text{Parameter}}}$$

Intuitively speaking, the marginal likelihood is nothing more than the general probability of the data, without considering any particular parameter.

How do data likelihood, prior belief influences the posterior belief?

Intuitively, when we have no knowledge about a subject, we tend to believe everything we are told, because we just do not know better.

A straight line indicates weak prior about the coin, every parameter is equal possible. The posterior is hence very easy to influence.

However, if we consider ourselves an expert the same subject, we can sometimes be very closed-mind and it would take a lots of data to convince us to get rid of our belief.

With a strong evidence about the coin, even if the data suggests the coin should be head-biased, since the data itself is very little, our posterior belief does not change very much.


Bayes may come over as difficult to reason about. Surprisingly the working principle is quite intuitive once we hit the right spot with the right example.