STA 247 - Week 10 lecture summary

Note that there was no lecture in week 9.

The family of exponential distributions

This is a family of distributions with one parameter, b. The the range for an exponential distribution is the positive real numbers. If X has the exp(b) distribution, then the probability density function (PDF) for X is

f_X(x) = b exp(-bx), for x > 0

We can find the cumulative distribution function (CDF) for X as follows:

F_X(x) = INTEGRAL(0 to x) b exp(-bt) dt
     = [ -exp(-bt) ]₀^x
     = (-exp(-bx)) - (-1)
     = 1 - exp(-bx)

We can use the CDF to find the probability that X lies in some interval (a,b), with a<b:

P(X in (a,b]) = P(X in [a,b)) = P(X in [a,b]) = P(X in (a,b)) = F_X(b) - F_X(a)

Whether the end-points are included in the interval doesn't matter, since the probability that X is exactly equal to any single number is zero.

Modeling waiting times with an exponential distribution

One common use of the family of exponential distributons is when modelling how long you have to wait for some event - eg, the time until the next failure of a computer system. An exponential distribution is an appropriate model if the event has an equal probability of occuring in each tiny time interval, and whether it occurs in one such interval is independent of whether it occurs in another such interval.

We can derive this by considering the geometric distribution for how many small intervals of time will pass until the event occurs. Suppose that the length of such a small interval is d, and assume that the probability of the event occurring in one such small interval is bd for some constant b. (This obviously can't be true when d is large, since it might give a probability greater than one, but it can be approximately true for small d.) The distribution of the number of intervals, N, before the event occurs will then be geometric(bd). Denoting the time that the event occurs (first) as the random variable X, we see that the probability that the event occurs at or before time x will equal the probability that x/d intervals pass before the event occurs. So

P(X <= x) = P(N <= x/d) = 1 - P(N > x/d) = 1 - (1 - bd)^x/d

In the limit as d goes to zero, this will approach 1 - exp(-bx), since as is well known from calculus, the limit of (1-1/n)ⁿ is 1/e (let n=1/bd). So we see that the cumulative distribution function for X is that of the exponential(b) distribution (since the CDF uniquely determines the distribution).

Expectation of a continuous random variable

Generalizing the definition of expectations to continuous random variables, we let

E(X) = INTEGRAL(range of X) x f_X(x) dx

Example: If X has the exponential(b) distribution, we get

E(X) = INTEGRAL(0 to infinity) x b exp(-bx) dx
= [ -(x+1/b) exp(-bx) ]₀^infinity
= 1/b

Some people parameterize exponential distributions in terms of this mean, so you have to be careful to check what someone means by an exp(2) distribution (it might be exp(1/2) according to the convention used here).

One can also show that the variance of an exponential(b) distribution, E((X-1/b)²), is 1/b².

Probability density of a function of a random variable

Suppose that Y = aX, for some positive constant a. What is the CDF for Y, the function F_Y(y), in terms of the CDF for X, the function F_X(x)?

F_Y(y) = P(Y <= y) = P(aX <= y) = P(X <= y/a) = F_X(y/a)

Similarly, what is the PDF for Y, the function f_Y(y), in terms of the PDF for X, the function f_X(x)? We can differentiate the CDF to find that

f_Y(y) = F'_Y(y) = F'_X(y/a) / a = f_X(y/a) / a

We can use these results to see how we can get all the exp(b) distributions by rescaling a random variable with the exp(1) distribution. If X ~ exp(1), then Y = X/b will have density function

f_Y(y) = f_X(y/(1/b)) / (1/b) = b exp(-by)

which is the PDF for the exp(b) distribution.

More generally, suppose that Y = g(X) for some monotonically increasing and differentiable function g, with inverse g^-1. Then

F_Y(y) = P(Y <= y) = P(g(X) <= y) = P(X <= g^-1(y)) = F_X(g^-1(y))

and from this

f_Y(y) = F'_Y(y) = F'_X(g^-1(y)) / g'(g^-1(y)) = f_X(g^-1(y)) / g'(g^-1(y))

If g may be either monotically increasing or monotonically decreasing, then f_Y(y) = f_X(g^-1(y)) / |g'(g^-1(y))|.

The "normal" or "Gaussian" family of distributions

The family of "normal" or "Gaussian" distributions has parameters m (which will turn out to be the mean) and s (which will turn out to be the standard deviation). A distribution in this family is usually denoted as N(m,s²) - note that the second parameter is usually the square of s (ie, the variance).

If X ~ N(m,s²), its density function (over the whole range of real numbers) is

f_X(x) = [1/(Cs)] exp(-(1/2)((x-m)/s)²))

where C is the square root of 2 pi.

One can get all these density functions as the densities for linear transformations of a random variable Z with the "standard normal" distribution, N(0,1), for which the PDF is

f_Z(x) = [1/(C)] exp(-z²/2)

Then X = sZ + m will have the N(m,s²) distribution.

A normal distribution is appropriate for a quantity that is the sum of many small, mostly independent influences. It may not be appropriate for a quantity that is influenced by a single large factor - eg, heights of people might not be normally distributed because we know that whether someone is a man or woman has a big effect on height. But looking only at men (or only at women) perhaps height is approximately normal, since there are perhaps no other large influences. (Of course, height cannot be exactly normal, because the normal distribution has a range from -infinity to +infinity, and height can't be negative, but heights of men might nevertheless be close to normally distributed.)

The Central Limit Theorem

The idea that a normal distribution is appropriate for a quantity that is the sum of many small influences is partly justified by the Central Limit Theorem (sometimes abbreviated to CLT):

Suppose X₁, X₂, ... are independent, identically-distributed random variables, all with the distribution of X, and E(X) exists, and Var(X) is finite. Define S_n = X₁ + X₂, ..., X_n and T_n = (S_n-nE(X))/sqrt(nVar(X)). Then T_n approaches the N(0,1) distribution as n goes to infinity - that is, the cumulative distribution function for T_n approaches the cumulative distribution function for the N(0,1) distribution.

If the distribution of X is continuous, the density function for T_n will also approach the N(0,1) density function, but if X is discrete, T_n won't have a density function.

One application of the Central Limit Theorem is to approximating the binomial distribution. If X ~ binomial(n,p), then we can view X as a sum of n independent Bernoulli(p) random variables, each with expectation p and variance p(1-p). The CLT then says that (X/n-p))/sqrt(p(1-p)/n) approaches the N(0,1) distribution as n goes to infinity. Informally, we might say that X approaches the N(np,np(1-p)) distribution.

Mixed continuous/discrete distributions

Some distributions are neither discrete, nor continuous, but rather a mixture of the two. Such distributions can still be described by their cumulative distribution funcition.

Example: A bus arrives regularly at a particular stop every 30 minutes, and waits at the stop for 5 minutes before leaving. Suppose you arrive at the stop at a time that is uniformly distributed over the day. What is the distribution of the time, T, you have to wait until you can get on the bus?

The range of T is 0 to 25 minutes. The cumulative distribution function for T is F_T(t) = 0 for t < 0, F_T(0) = 1/6, F_T(t) = (1/6)+(t/30) for 0 < t < 25, and F_T(t) = 1 for t >= 25.

Joint distributions of continuous and discrete random variables

Another way that discrete and continuous random variables are combined is when we have a joint distribution for a discrete variable and a continuous variable. Suppose X is discrete (say a range of { 1, 2, 3 }) and Y is continuous (say a range of (0,1)). We need to specify joint probabilities like

P(X=x, Y in (a,b))

We can use the multiplication rule to write this as

P(X=x) P(Y in (a,b) | X=x)

The first factor above can be specified as usual with a probability mass function for X. The second factor can be specified by giving a conditional probability density function for Y for each possible x.

Mixture distributions

One important use for having a discrete random variable together with a continuous random variable is to specify a mixture distribution.

As an example, consider modelling the heights (in cm) of adults. We know that height, H, depends a lot on sex, so we might introduce a random variable M that is 0 for females and 1 for males. We might assume that P(M=0) = P(M=1) = 1/2. We might also model the heights of males as normal with mean 175 and standard deviation 17, and the heights of females as normal with mean 155 and standard deviation 15. (Normal distributions can't be exactly right, since they give a small positive probability to negative heights, but they may be close enough to be useful.)

This mixture model might be better than assuming that the distribution of H for all adults is normal. Even if we don't know which people are male and which are female, we might model height in this way. M would then be a "latent variable", which is not observed, but which helps model the distribution of what is observed.

We can also write the probability density function for a mixture model without mentioning any latent variable. If Y is modelled by a mixture of distributions with density functions f_0,Y and f_1,Y, with probabilities p₀ and p₁ = 1 - p₀, then the density function for Y can be written as

f_Y(y) = p₀ f_0,Y(y) + p₁ f_1,Y(y)

There is a corresponding formula for the cumulative distribution function:

F_Y(y) = p₀ F_0,Y(y) + p₁ F_1,Y(y)

A formula for expectation from conditional expectation

Y given X, written E(Y|X), is a random variable (not a single number like ordinary expectation). The conditional expectation varies with the value of he random variable conditioned on, so when X=x the value of E(Y|X) is E(Y|X=x).

Since E(Y|X) is a random variable, we can ask what it's expectation is. It turns out to be as follows:

Theorem: E(Y) = E(E(Y|X)).

Proof:

E(Y) = SUM(over y) y P(Y=y)
     = SUM(over y) y SUM(over x) P(Y=y, X=x)
     = SUM(over y) y SUM(over x) P(Y=y | X=x) P(X=x)
     = SUM(over x) P(X=x) SUM(over y) y P(Y=y | X=x)
     = SUM(over x) P(X=x) E(Y|X=x)
     = E(E(Y|X))

As an example, consider the model for heights of adult males and females above. From our normal models for height given sex, we get that E(H|M=0) = 155 and E(H|M=1) = 175. So the above theorem tells us that

E(H) = E(E(H|M)) = P(M=0)E(H|M=0) + P(M=1)E(H|M=1) = (1/2) 155 + (1/2) 175 = 165