STA 247 - Week 11 lecture summary

Bayes' Rule for continous random variables

Consider last week's mixture model for the height of an adult, H, in which we specified H|M=0 as N(155,152) and H|M=1 as N(175,172), where M=1 indicates male and M=0 indicates female, with P(M=1)=P(M=0)=1/2. Suppose we measure that some adult's height is 180cm. How likely is it that they are male?

We can answer this by finding the conditional probability that M=1 given the measured height, using Bayes' Rule. But we have a problem. P(M=1|H=180) is a conditional probability in which the event that we condition on, H=180, has zero probability. We had previously said that such a conditional probability is undefined, since its definition involves a division by zero.

However, our measurement doesn't actually tell us that H=180. It has some finite precision, and so tells us something like that H is in the interval (179.9,180.1). So what we actually need to find is

P(M=1 | H in (179.9,180.1)) = P(M=1) P(H in (179.9,180.1)) | M=1) / P(H in (179.9,180.1))
which is well defined. And we could actually compute it, using integrals over the normal probability density function.

However, when the precision of a measurement is high compared to the standard deviation of the quantity measured, we will find that these integrals are over intervals where the probability density is almost constant. So the integral is approximately equal to the probability density at the centre of the interval times the width of the integral. For example,

P(H in (179.9,180.1) | M=1) is approximately 0.2 f1,H(180)
where f1,H is the probability density function for the N(175,172) distribution, which is the distribution for H when M=1.

When we substitute this into Bayes' Rule, we find that the width of the interval (0.2 in this example) cancels out. We then have something that looks just like Bayes' Rule, but with probability densities for H instead of probabilities.

We can often get away with this trick, treating probability densities almost like probabilities. Note, however, that there certainly are differences - for example, probabilities can't be greater than one, but probability densities can be greater than one.

Markov chains

One simple way of modeling dependence between random variables is to use the following directed graphical model:
X0 ---> X1 ---> X2 ---> X3 ---> ...
In this model, Xi is conditionally independent of Xk given Xj whenever i < j < k. This is called the "Markov property", and the model above is called a "Markov model" or a "Markov chain". The variables X0, X1, X1, ... are often seen as being ordered by "time", measured by integers, so we may use t as the index, writing Xt for one of these variables, although in some applications the variables may be ordered in some other way, such as in space, or by position in a file. We sometimes call the value of Xt the "state" at time t.

Example applications: Markov models arise in many application areas. Some examples:

In all these applications, a Markov chain might or might not be a good model - ie, the Markov property might or might not hold. But even if a Markov chain isn't a perfect model, it might be a much better model than assuming that the Xt are independent. We might prefer a Markov model to a more complex model because, as we'll see, it is relatively easy to compute probabilities relating to a Markov model (as long as the number of possible values for Xt isn't too large).

Specifying a Markov chain

Let's suppose that the random variables making up a Markov chain have some finite range, such as { 1, 2, ..., K }. To specify the joint distribution of all the Xt, we will need to specify

The initial probabilities for the state - in other words, P(X0=x) for all x in the range of X0.

The transition probabilities for moving from a state at time t to a state at time t+1 - in other words, P(Xt+1=x' | Xt+1=x) for all x and x'.

If the transition probabilities are the same for all t, we say the Markov chain is "homogeneous".

For a homogeneous Markov chain, we will write

P0(x) for P(X0=x).

P(1)(x --> x') for P(Xt+1=x' | Xt+1=x)

Note that specifying a Markov chain with with K possible states requires K-1 numbers for the initial probabilities (the last probability is determined from the others by the requirement that they sum to one) and K(K-1) numbers for the transition probabilities.

Finding the marginal distribution for the state at time t

Suppose we want to find Pn(x) = P(Xn = x).

We know P0(x). We can find P1(x) as follows:

P1(x1) = P(X1 = x1)
     = SUM(over x0) P(X1 = x1, X0 = x0)
     = SUM(over x0) P(X1 = x1 | X0 = x0) P(X0 = x0)
     = SUM(over x0) P(1)(x0 --> x1) P0(x0)

Similarly, we could find P4(x) as

P4(x4) = P(X4 = x4)
     = SUM(over x0, x1, x2, x3) P0(x0) P(1)(x0 --> x1) P(1)(x1 --> x2) P(1)(x2 --> x3) P(1)(x3 --> x4)
But the summation here is over K4 terms, and in general using this method to compute Pn(x) would take time that grows exponentially with n.

Fortunately, we can instead proceed sequentially, computing P1, P2, P3, ... in turn (of course, we know P0 before we start). At each stage, we build a table of values for Pn, which we can use when computing the next table. To compute Pn when we already have a table of values for Pn-1, we just need to write it as follows:

Pn(xn) = SUM(over x0, ..., xn-1) P(Xn = xn | X0 = x0, ..., Xn-1 = xn-1) P(X0 = x0, ..., Xn-1 = xn-1)
     = SUM(over x0, ..., xn-1) P(Xn = xn | Xn-1 = xn-1) P(X0 = x0, ..., Xn-1 = xn-1)
     = SUM(over xn-1) P(Xn = xn | Xn-1 = xn-1) SUM(over x0, ..., xn-2) P(X0 = x0, ..., Xn-1 = xn-1)
     = SUM(over xn-1) P(Xn = xn | Xn-1 = xn-1) P(Xn-1 = xn-1)
     = SUM(over xn-1) P(1)(xn-1 --> xn)) Pn-1(xn-1)
Note how the Markov property is crucial in simplifying P(Xn = xn | X0 = x0, ..., Xn-1 = xn-1) to P(Xn = xn | Xn-1 = xn-1). The end result is that to compute Pn when we already know Pn-1 requires a sum of only K terms, so computing Pn takes time proportional to Kn rather than Kn.