STA 247 - Week 4 lecture summary

Random variables

The concept of a random variable will provide an easier way to specify events that involve numbers, and lead to the concept of the "expectation" (also called "average" or "mean") of a random numerical quantity.

Example: We flip a coin four times. The possible outcomes are as follows:

    H H H H
    H H H T
    H H T H
    H H T T
    H T H H
    H T H T
    H T T H
    H T T T
    T H H H
    T H H T
    T H T H
    T H T T
    T T H H
    T T H T
    T T T H
    T T T T

Consider the following events:

A = Exactly two of the four flips are heads
B = At least two of the four flips are heads
C = The number of initial heads (before the first tail, or end) is three

We could define these events as subsets of the sample space, S, by explicitly listing the outcomes that make up each event, but it is easier to define these events using two "random variables".

Definition: A random variable is a function that assigns a number to each outcome in the sample space. In other words, it is a map from S to the set of real numbers.

We denote random variables by upper case letters (usually near the end of the alphabet) like X, Y, and Z. We also use upper case letters for events, so you have to tell them apart from context. If X is a random variable, and w is an element of S, then X(w) is the value of the random variable X when the outcome is w.

For the four flip example above, let's define two random variables:

X = The number of heads in the four flips
Y = The number of initial heads
These random variables are functions that can be defined by tables of their values for every possible outcome, as follows:
       w       X(w)    Y(w)
    H H H H     4       4
    H H H T     3       3
    H H T H     3       2
    H H T T     2       2
    H T H H     3       1
    H T H T     2       1
    H T T H     2       1
    H T T T     1       1
    T H H H     3       0
    T H H T     2       0
    T H T H     2       0
    T H T T     1       0
    T T H H     2       0
    T T H T     1       0
    T T T H     1       0
    T T T T     0       0

We can now use the random variables X and Y to define the events A, B, and C described above:

A = (X=2)
B = (X>1)
C = (Y=3)
Here, (X=2) means the set of all w in S such that X(w) equals 2, and similarly for the others above. We drop the parentheses when they are not necessary.

We can compute the probabilities of these events:

P(A) = P(X=2) = C(4,2)/16 = 6/16
P(B) = P(X>1) = (C(4,2)+C(4,3)+C(4,4))/16 = 11/16
P(C) = P(Y=3) = 1/16
One can find these probabilities as above, or by counting outcomes in the table above that shows the values of X and Y for each outcome.

We can also look at things like P(X=Y), which is the probability that the total number of heads is equal to the number of initial heads (that is, that no head comes after a tail), which equals 5/16. This also can be found from the table above.

Probability mass functions, and distributions

We could more easily compute probabilities like P(X>1), that are defined only in terms of the random variable X, if we had a table that for each possible value of X (the range of X) gives the total probability of all outcomes that map to that value for X. Such a table is called a probability mass function. he probability mass functions for X and Y are as follows:

     x    P(X=x)        y    P(Y=y)
     0     1/16         0     1/2
     1     4/16         1     1/4
     2     6/16         2     1/8
     3     4/16         3     1/16
     4     1/16         4     1/16
Here, I have followed the common convention of referring to particular values of a random variable with the corresponding lower case letter. With this convention, P(X=x) is sometimes abbreviated to P(x), with it being assumed that x is a possible value of X. But note that P(3) is meaningless, since you can't tell what random variable 3 might be a value of.

The probability mass function for X allows us to find the probability of any event that involves only X without having to refer to the sample space or the map from outcomes to numbers that defines X. For example, we can find P(X>1) by just summing the last three entries in the table above: P(X>1)=6/16+4/16+1/16=11/16.

For a random variable with a finite range, a table of its probability mass function is one way of specifying its distribution. For a random variable with an infinite, especially continuous, range, we need other ways of specifying the distribution (which gives the probability of any event involving the random variable).

Example: Suppose we draw 3 balls without replacement from an urn with 2 red, 2 green, and 2 blue balls. Define the following random variables:

R = Number of red balls drawn
G = Number of green balls drawn
B = Number of blue balls drawn
What is the probability mass function of R?

Here's the answer:

     r    P(R=r)
     0    24/120
     1    72/120
     2    24/120

What is the probability mass function of G?

Answer: It's the same as the probability mass function of R. The random variables R and G have the same distribution, even though they are not the same random variable. We can define two other random variables in terms of R, G, and B:

X = R + G, which means that X(w)= R(w)+G(w)
Y = 3 - B, which means that Y(w)=3-B(w)
Here, one can figure out that X and Y are the same random variable (that is, they define the same map from a w in S to a number). They therefore must also have the same distribution.

Example: We use some sort program to sort files of 100 names into alphabetical order. Suppose that the names are drawn (with replacement) from a set of 1000 names, with all possible files drawn this way being equally likely. We can define random variables such as

C = Number of comparisons of names that the sort program makes
T = Time in seconds that the sort program takes to run
Note that the sample space for this problem is enormous (1000100=10300 possible input files), so we wouldn't want to define these random variables by giving a table of their values. Also, we may not fully understand the behaviour of our sort program, so we don't really understand the mapping from input file to C or T. We can nevertheless define random variables as above, and then use them in figuring out how we can investigate the performance of this sort program empirically.

Expected value of a random variable

Definition: The expected value of a random variable, X, which is also called its expectation, mean, or average value, is defined to be

E(X) = SUM(over x) x P(X=x)
Here, the sum is over all possible values of the random variable X (that is, over its range). We'll assume at the moment that the range is finite. (If the range is countably infinite (eg, the integers), then this sum will be infinite, and might or might not converge; if the range is continuous (eg, the reals), then the sum will have to be replaced by an integral. But we won't deal with these cases now.)

One can visualize the expected value of a random value by imagining a graph in which above each value for x (on the horizontal axis) we put a bar with height proportional to the probability for that value (ie, this is a graph of the probability mass function). Then the expected value of X is the point on the horizontal axis where the bars would balance if we supported the graph at that point.

Example: For the example above of flipping four coins, the expected values of X and Y are

E(X) = 0 x (1/16) + 1 x (4/16) + 2 x (6/16) + 3 x (4/16) + 4 x (1/16) = 2
E(Y) = 0 x (1/2) + 1 x (1/4) + 2 x (1/8) + 3 x (1/16) + 4 x (1/16) = 15/16

Example: For the example with the sort program above, E(C) is the average number of comparisons done by the sort program, when run on a random input file of 100 names, and E(T) is its average run time on such input files. These might be quantities that we are interested in estimating.

Theorem: If Z=g(X) for some function g, then

E(Z) = SUM(over x) g(x) P(X=x)
Proof:
E(Z) = SUM(over z) z P(Z=z)
  = SUM(over z) z SUM(over x such that g(x)=z) P(X=x)
  = SUM(over z) SUM(over x such that g(x)=z) z P(X=x)
  = SUM(over z) SUM(over x such that g(x)=z) g(x) P(X=x)
  = SUM(over x) g(x) P(X=x)

Example: For the four coin flip example, suppose we define Z=|X-2|. We could find E(Z) by first finding the probability mass function of Z, which is

    z    P(Z=z)
    0     6/16
    1     8/16
    2     2/16
and then finding E(Z) from the definition of expectation, as
E(Z) = 0 x (6/16) + 1 x (8/16) + 2 x (2/16) = 12/16
Alternatively, we can use the theorem above, and find E(Z) as
E(Z) = |0-2| x (1/16) + |1-2| x (4/16) + |2-2| x (6/16) + |3-2| x (4/16) + |4-2| x (1/16)
  = 2 x (1/16) + 1 x (4/16) + 0 x (6/16) + 1 x (4/16) + 2 x (1/16) = 12/16

Joint probabilities

We can take unions and intersections of events defined using random variables, just as for any events. For the four coin flip example from above, we could look at these probabilities:
P( (X=3) U (Y=2) )
P( (X=3) intersection (Y=2) )
Intersections of such events are used so often that we use an abbreviated notation, in which the probability of the intersection above is written as
P(X=3, Y=2)
So comma means "and" in this context.

We can figure out these probabilities from the earlier tables:

P(X=3, Y=2) = P( { HHTH } ) = 1/16
P( (X=3) U (Y=2) ) = P(X=3) + P(Y=2) - P(X=3,Y=2) = 4/16 + 1/8 - 1/16 = 5/16

The joint distribution for two random variables specifies the probabilities for all events involving only those two random variables. When the random variables have finite range, the distribution can be specified as the joint probability mass function.

For the four coin flip example, the joint distribution of X and Y will give P(X=x,Y=y) for all x and y. Here is a table of these probabilities:

      y=   0      1      2      3      4
      -------------------------------------
 x=   |                                    |
    0 |   1/16    0      0      0      0   | 1/16
      |                                    |
    1 |   3/16   1/16    0      0      0   | 4/16
      |                                    |
    2 |   3/16   2/16   1/16    0      0   | 6/16
      |                                    |
    3 |   1/16   1/16   1/16   1/16    0   | 4/16
      |                                    |
    4 |    0      0      0      0     1/16 | 1/16
      |                                    |
      -------------------------------------

          8/16   4/16   2/16   1/16   1/16
The joint probabilities are in the main area. I have written marginal probabilities in the right and bottom margins, which are the sums of the rows and columns. These are the probabilities P(X=x) and P(Y=y). The word "marginal" is redundant (these are just probabilities), but indicates that the probabilities were obtained from the table of joint probabilities.

Conditional distributions

We can find the conditional probability for a random variable to have some value given that another random variable has some value by applying the definition of conditional probability.

For the four coin flip example, we can find

P(Y=1|X=2) = P(Y=1, X=2) / P(X=2) = (2/16) / (6/16) = 1/3

A table of conditional probabilities given some event specifies the conditional distribution given that event.

For the four coin flip example, the conditional distribution of Y given X=2 is as follows:

    y   P(Y=y|X=2)
    0      1/2
    1      1/3
    2      1/6
    3       0
    4       0
We can get this from the joint probability table by taking the numbers in the row for X=2 and dividing them by their sum (the marginal probability of X=2).

Independence of random variables

Random variables X and Y are independent if

P(X=x, Y=y) = P(X=x) P(Y=y)
for all x in the range of Xand y in the range of Y. Similarly, three random variables X, Y, and Z are mutually independent if
P(X=x, Y=y, Z=z) = P(X=x) P(Y=y) P(Z=z)
for all x, y, z.

Mutual independence of random variables X, Y, and Z implies mutual independence of events A, B, and C if A is defined only in terms of X, B is defined only in tems of Y, and C is defined only in terms of Z (eg, if A is (X<4), B is (Y=2), and C is (2 < Z < 7)).

Also, mutual independence of X, Y, and Z implies mutual independence of U=f(X), V=g(Y), and W=h(Z) for any functions f, g, and h.

Multiplication rule for random variables

Apply the multiplication rule for events to events defined by random variables having certain values, we see that, for instance,
P(X=x, Y=y, Z=z) = P(X=x) P(Y=y | X=x) P(Z=z | X=x, Y=y)
or looking at the variables in one of the other possible orders,
P(X=x, Y=y, Z=z) = P(Z=z) P(Y=y | Z=z) P(X=x | Y=y, Z=z)

One use of this is as a way of specifying joint probabilities for a problem where what we naturally know is some marginal and conditional probabilities.

Conditional independence

We say that random variables X and Y are independent given Z if
P(X=x, Y=y | Z=z) = P(X=x | Z=z) P(Y=y | Z=z)
for all x, y, z in the ranges of the random variables.

Theorem: If random variables X and Y are independent given Z, then for any x, y, z with P(Z=z) > 0

P(X=x | Y=y, Z=z) = P(X=x | Z=z)

Proof:

P(X=x | Y=y, Z=z) = P(X=x, Y=y, Z=z) / P(Y=y, Z=z)
  = [ P(X=x, Y=y | Z=z) P(Z=z) ] / [ P(Y=y) | Z=z) P(Z=z) ]
  = P(X=x, Y=y | Z=z) / P(Y=y) | Z=z)
  = [ P(X=x | Z=z) P(Y=y | Z=z) ] / P(Y=y) | Z=z)
  = P(X=x | Z=z)

This theorem may let us simplify some factors in the multiplication rule when we know that some random variables are conditionally independent given some other random variables.