The concept of a *random variable* will provide an easier
way to specify events that involve numbers, and lead to the
concept of the "expectation" (also called "average" or "mean")
of a random numerical quantity.

**Example:** We flip a coin four times. The possible outcomes
are as follows:

H H H H H H H T H H T H H H T T H T H H H T H T H T T H H T T T T H H H T H H T T H T H T H T T T T H H T T H T T T T H T T T T

Consider the following events:

A= Exactly two of the four flips are heads

B= At least two of the four flips are heads

C= The number of initial heads (before the first tail, or end) is three

We could define these events as subsets of the sample space, *S*,
by explicitly listing the outcomes that make up each event, but it is
easier to define these events using two "random variables".

**Definition:** A *random variable* is a function that
assigns a number to each outcome in the sample space. In other words,
it is a map from S to the set of real numbers.

We denote random variables by upper case letters (usually near the
end of the alphabet) like *X*, *Y*, and *Z*. We also
use upper case letters for events, so you have to tell them apart
from context. If *X* is a random variable, and *w* is an
element of *S*, then *X*(*w*) is the value of the
random variable *X* when the outcome is *w*.

For the four flip example above, let's define two random variables:

These random variables are functions that can be defined by tables of their values for every possible outcome, as follows:X= The number of heads in the four flips

Y= The number of initial heads

w X(w) Y(w) H H H H 4 4 H H H T 3 3 H H T H 3 2 H H T T 2 2 H T H H 3 1 H T H T 2 1 H T T H 2 1 H T T T 1 1 T H H H 3 0 T H H T 2 0 T H T H 2 0 T H T T 1 0 T T H H 2 0 T T H T 1 0 T T T H 1 0 T T T T 0 0

We can now use the random variables *X* and *Y* to define
the events *A*, *B*, and *C* described above:

Here, (A= (X=2)

B= (X>1)

C= (Y=3)

We can compute the probabilities of these events:

P(One can find these probabilities as above, or by counting outcomes in the table above that shows the values ofA) = P(X=2) = C(4,2)/16 = 6/16

P(B) = P(X>1) = (C(4,2)+C(4,3)+C(4,4))/16 = 11/16

P(C) = P(Y=3) = 1/16

We can also look at things like P(*X*=*Y*), which is
the probability that the total number of heads is equal to
the number of initial heads (that is, that no head comes after a tail),
which equals 5/16. This also can be found from the table above.

We could more easily compute probabilities like P(*X*>1), that are
defined only in terms of the random variable *X*, if we had
a table that for each possible value of *X* (the *range* of
*X*) gives the total probability of all outcomes that map
to that value for *X*. Such a table is called a *probability
mass function*. he probability mass functions for *X*
and *Y* are as follows:

x P(X=x) y P(Y=y) 0 1/16 0 1/2 1 4/16 1 1/4 2 6/16 2 1/8 3 4/16 3 1/16 4 1/16 4 1/16Here, I have followed the common convention of referring to particular values of a random variable with the corresponding lower case letter. With this convention, P(

The probability mass function for *X* allows us to find the
probability of any event that involves only *X* without having to
refer to the sample space or the map from outcomes to numbers that
defines *X*. For example, we can find P(*X*>1) by just
summing the last three entries in the table above:
P(*X*>1)=6/16+4/16+1/16=11/16.

For a random variable with a finite range, a table of its probability
mass function is one way of specifying its *distribution*. For a
random variable with an infinite, especially continuous, range, we
need other ways of specifying the distribution (which gives the
probability of any event involving the random variable).

**Example:** Suppose we draw 3 balls without replacement from
an urn with 2 red, 2 green, and 2 blue balls. Define the following
random variables:

What is the probability mass function ofR= Number of red balls drawn

G= Number of green balls drawn

B= Number of blue balls drawn

Here's the answer:

r P(R=r) 0 24/120 1 72/120 2 24/120

What is the probability mass function of *G*?

Answer: It's the *same* as the probability mass function of *R*.
The random variables *R* and *G* have the same distribution, even
though they are not the same random variable.
We can define two other random variables in terms of *R*, *G*,
and *B*:

Here, one can figure out thatX=R+G, which means thatX(w)=R(w)+G(w)

Y= 3 -B, which means thatY(w)=3-B(w)

**Example:** We use some sort program to sort files of
100 names into alphabetical order. Suppose that the names are drawn
(with replacement) from a set of 1000 names, with all possible files
drawn this way being equally likely. We can define random variables
such as

Note that the sample space for this problem is enormous (1000C= Number of comparisons of names that the sort program makes

T= Time in seconds that the sort program takes to run

**Definition:** The *expected value* of a random variable,
*X*, which is also called its *expectation*, *mean*, or
*average value*, is defined to be

E(Here, the sum is over all possible values of the random variableX) = SUM(overx)xP(X=x)

One can visualize the expected value of a random value by imagining
a graph in which above each value for *x* (on the horizontal axis) we
put a bar with height proportional to the probability for that value
(ie, this is a graph of the probability mass function). Then the
expected value of *X* is the point on the horizontal axis where
the bars would balance if we supported the graph at that point.

**Example:** For the example above of flipping four coins, the
expected values of *X* and *Y* are

E(X) = 0 x (1/16) + 1 x (4/16) + 2 x (6/16) + 3 x (4/16) + 4 x (1/16) = 2

E(Y) = 0 x (1/2) + 1 x (1/4) + 2 x (1/8) + 3 x (1/16) + 4 x (1/16) = 15/16

**Example:** For the example with the sort program above, E(*C*)
is the average number of comparisons done by the sort program, when run
on a random input file of 100 names, and E(*T*) is its average run
time on such input files. These might be quantities that we are interested
in estimating.

**Theorem:** If *Z*=*g*(*X*) for some function *g*,
then

E(Proof:Z) = SUM(overx)g(x) P(X=x)

E(Z) = SUM(overz)zP(Z=z)

= SUM(overz)zSUM(overxsuch thatg(x)=z) P(X=x)

= SUM(overz) SUM(overxsuch thatg(x)=z)zP(X=x)

= SUM(overz) SUM(overxsuch thatg(x)=z)g(x) P(X=x)

= SUM(overx)g(x) P(X=x)

**Example:** For the four coin flip example, suppose we define
*Z*=|*X*-2|. We could find E(*Z*) by first finding the
probability mass function of *Z*, which is

z P(Z=z) 0 6/16 1 8/16 2 2/16and then finding E(

E(Alternatively, we can use the theorem above, and find E(Z) = 0 x (6/16) + 1 x (8/16) + 2 x (2/16) = 12/16

E(Z) = |0-2| x (1/16) + |1-2| x (4/16) + |2-2| x (6/16) + |3-2| x (4/16) + |4-2| x (1/16)

= 2 x (1/16) + 1 x (4/16) + 0 x (6/16) + 1 x (4/16) + 2 x (1/16) = 12/16

P( (Intersections of such events are used so often that we use an abbreviated notation, in which the probability of the intersection above is written asX=3) U (Y=2) )

P( (X=3) intersection (Y=2) )

P(So comma means "and" in this context.X=3,Y=2)

We can figure out these probabilities from the earlier tables:

P(X=3,Y=2) = P( { HHTH } ) = 1/16

P( (X=3) U (Y=2) ) = P(X=3) + P(Y=2) - P(X=3,Y=2) = 4/16 + 1/8 - 1/16 = 5/16

The *joint distribution* for two random variables specifies
the probabilities for all events involving only those two
random variables. When the random variables have finite
range, the distribution can be specified as the joint
probability mass function.

For the four coin flip example, the joint distribution of *X*
and *Y* will give P(*X*=*x*,*Y*=*y*) for all
*x* and *y*. Here is a table of these probabilities:

y= 0 1 2 3 4 ------------------------------------- x= | | 0 | 1/16 0 0 0 0 | 1/16 | | 1 | 3/16 1/16 0 0 0 | 4/16 | | 2 | 3/16 2/16 1/16 0 0 | 6/16 | | 3 | 1/16 1/16 1/16 1/16 0 | 4/16 | | 4 | 0 0 0 0 1/16 | 1/16 | | ------------------------------------- 8/16 4/16 2/16 1/16 1/16The joint probabilities are in the main area. I have written

We can find the conditional probability for a random variable to have some value given that another random variable has some value by applying the definition of conditional probability.

For the four coin flip example, we can find

P(Y=1|X=2) = P(Y=1,X=2) / P(X=2) = (2/16) / (6/16) = 1/3

A table of conditional probabilities given some event specifies
the *conditional distribution* given that event.

For the four coin flip example, the conditional distribution
of *Y* given *X*=2 is as follows:

y P(Y=y|X=2) 0 1/2 1 1/3 2 1/6 3 0 4 0We can get this from the joint probability table by taking the numbers in the row for

Random variables *X* and *Y* are *independent* if

P(for allX=x,Y=y) = P(X=x) P(Y=y)

P(for allX=x,Y=y,Z=z) = P(X=x) P(Y=y) P(Z=z)

Mutual independence of random variables *X*, *Y*, and *Z*
implies mutual independence of events *A*, *B*, and *C*
if *A* is defined only in terms of *X*, *B* is defined
only in tems of *Y*, and *C* is defined only in terms of *Z*
(eg, if *A* is (*X*<4), *B* is (*Y*=2), and
*C* is (2 < *Z* < 7)).

Also, mutual independence of *X*, *Y*, and *Z*
implies mutual independence of *U*=*f*(*X*),
*V*=*g*(*Y*), and *W*=*h*(*Z*)
for any functions *f*, *g*, and *h*.

P(or looking at the variables in one of the other possible orders,X=x,Y=y,Z=z) = P(X=x) P(Y=y|X=x) P(Z=z|X=x,Y=y)

P(X=x,Y=y,Z=z) = P(Z=z) P(Y=y|Z=z) P(X=x|Y=y,Z=z)

One use of this is as a way of specifying joint probabilities for a problem where what we naturally know is some marginal and conditional probabilities.

P(for allX=x,Y=y|Z=z) = P(X=x|Z=z) P(Y=y|Z=z)

**Theorem:** If random variables *X* and *Y* are independent
given *Z*, then for any *x*, *y*, *z* with
P(*Z*=*z*) > 0

P(X=x|Y=y,Z=z) = P(X=x|Z=z)

Proof:

P(X=x|Y=y,Z=z) = P(X=x,Y=y,Z=z) / P(Y=y,Z=z)

= [ P(X=x,Y=y|Z=z) P(Z=z) ] / [ P(Y=y) |Z=z) P(Z=z) ]

= P(X=x,Y=y|Z=z) / P(Y=y) |Z=z)

= [ P(X=x|Z=z) P(Y=y|Z=z) ] / P(Y=y) |Z=z)

= P(X=x|Z=z)

This theorem may let us simplify some factors in the multiplication rule when we know that some random variables are conditionally independent given some other random variables.