We say that random variables *X*_{1},
*X*_{2}, ..., *X*_{n} are
*independent and identically distributed* (abbreviated as i.i.d.)
if all the *X*_{i} are mutually independent, and
they all have the same distribution.

**Examples**: Put *m* balls with numbers written on them in
an urn. Draw *n* balls from the urn *with replacement*, and
let *X*_{i} be the number on the *i*th ball.
Then *X*_{1}, *X*_{2}, ...,
*X*_{n} will be i.i.d.

But if we draw the balls *without replacement*,
*X*_{1}, *X*_{2}, ...,
*X*_{n} will not be i.i.d. - they will all have
the same distribution, but will not be independent.

If we draw the balls with replacement, but let
*X*_{i} be *i* times the number on the
*i*th ball, then *X*_{1}, *X*_{2}, ...,
*X*_{n} will not be i.i.d. - they will be
independent, but they will have different distributions.

**Theorem:** If *X*_{1}, *X*_{2}, ...,
*X*_{n} are i.i.d. with the same distribution as a
random variable *X*, then

E(

X_{1}+X_{2}+ ... +X_{n}) =nE(X)Var(

X_{1}+X_{2}+ ... +X_{n}) =nVar(X)SD(

X_{1}+X_{2}+ ... +X_{n}) = sqrt(n) SD(X)

We can view a random variable *Y* that has the
binomial(*n*,*p*) distribution as the sum of
*n* i.i.d. random variables that all have the
Bernoulli(*p*) distribution. That is, we let

whereY=X_{1}+X_{2}+ ... +X_{n}

Recall that the mean of a Bernoulli(*p*) random variable is *p*.
We can also compute that the variance of a Bernoulli(*p*) random
variable is

p(1-p)^{2}+ (1-p) (0-p)^{2}=p(1-p)

From this and the theorem above about sums of i.i.d. random
variables, we can conclude that if *Y* has then
binomial(*n*,*p*) distribution, the E(*Y*) = *np*
and Var(*Y*) = *n**p*(1-*p*).

If *X*_{1}, *X*_{2}, ...,
*X*_{n} are i.i.d., all having the same
distribution as the random variable *X*, then

E(BAR-We say that BAR-X) = (1/n) E(X_{1}+X_{2}+ ... +X_{n})

= (1/n)nE(X) = E(X)

But being unbiased doesn't guarantee that BAR-*X* is usually
close to E(*X*). To see how close it is likely to be, we need
to look at the standard deviation of BAR-*X*:

SD(BAR-So asX) = (1/n) SD(X_{1}+X_{2}+ ... +X_{n})

= (1/n) sqrt(n) SD(X) = (1/sqrt(n)) SD(X)

The mean of a sequence of *n* i.i.d. random variables can be
proved to approach (in a certain sense) the expected value of (any)
one of the variables, as *n* goes to infinity. This is called
the *Law of Large Numbers*. There are several forms of this law.
The one I prove below is the "weak" form, and is restricted to random
variables with finite variance (though the law can be proved without
assuming finite variance using more sophisticated techniques).

**Theorem:** Let *X* be a random variable for which the
expectation, E(X), exists, and for which the variance, Var(*X*),
is finite. If the infinite sequence of random variables
*X*_{1}, *X*_{2}, ... are i.i.d., all
having the same distribution as the random variable *X*, then
for any *e* > 0,

LIMIT(where BAR-n->oo) P(|BAR-X_{n}-E(X)| >e) = 0

**Proof:** We need to show that for any *e* > 0 and any
*d* > 0, there exists some integer *n*^{*} such that
for all *n* > *n*^{*},

P(|BAR-This will hold if we setX_{n}-E(X)| >e) <d

P(|BAR-Noting that SD(BAR-X_{n}-E(X)| >=e) = P(|BAR-X_{n}-E(X)| >= (e/ SD(BAR-X_{n})) SD(BAR-X_{n})) <= (SD(BAR-X_{n}) /e)^{2}

P(|BAR-which is what we needed to prove.X_{n}-E(X)| >e) <= (SD(BAR-X_{n}) /e)^{2}< ((SD(X)/sqrt(n^{*})) /e)^{2}= Var(X) / (n^{*}e^{2}) =d

Note that the Law of Large Numbers implies that the fraction of
times an event occurs in many repetitions of a situation converges
to the probability of that event, since we can define a random
variable *X* to be 1 if event *A* occurs and 0 otherwise,
in which case P(*A*)=E(*X*).
The Law of Large Numbers therefore provides a justification of sorts
for the interpretation of probability as relative frequency in
many repetitions.

The Law of Large Numbers is also a justification for using computer simulation to estimate probabilities and expectations.

A *directed graphical model* (also called a Bayesian network or
a belief network) is a way of specifying a joint probability
distribution for several random variables via a simplification
of the multiplication rule.

Recall that any joint probability mass function, say for
*X*, *Y*, and *Z* can be written (in notation
abbreviating *X*=*x* to ust *x*) using the
multiplication rule as

P(Ifx,y,z) = P(x) P(y|x) P(z|x,y)

A directed graphical model has a node for every random variable,
with arrows from some nodes to other nodes, with there being no
directed cycles. Since there are no cycles, there is always at least
one way to order the variables so that all the arrows point forward.
If we write the joint probability as above using the multiplication
rule in this order, the *absence* of an arrow from *X* to a
later node *Y* indicates that *X* is C.I. of *Y* given
the parents of *Y*, so we can omit *x* from the factor
P(y|...,x,...), thereby simplifying it.

**Example:** At some ski resort,

It may be reasonable to assue that the total medical cost depends only on the number of skiers who break their leg, and not on the number of skiers, except that the number of skiers affects how many skiers break their leg. In other words, we might assume thatX= # of skiers on the hill one day

Y= # of skiers who break a leg that day

Z= total cost of medical treatment for broken legs that day

X ---> Y ---> Z

**Example:** We roll a red die and a green die, and then
flip a coin as many times as the sum of the numbers showing on
the red and green dice. We define

This situation is described by the directed graphical model below:R= number showing on the red die

G= number showing on the green die

X=R+G

H= number of heads when the coin is flipped X times.

R -- \ \ --> X ---> H --> / / G --This means that we can write the joint probability mass function for

P(r,g,x,h) = P(r) P(g) P(x|r,g) P(h|x)

If we assume that some directed graphical model is valid, we can deduce various independence and conditional independence relationships from it.

A theorem (which I won't prove here) says that *X* is
C.I. of *Y* given *Z* if *X* is "d-separated"
from *Y* given *Z*, which means that all paths from
*X* to *Y* in the graph are "blocked" given *Z*.
A path is any route from *X* to *Y*, travelling along
arrows in either direction. A path is blocked if any node along
it blocks the path. There are three possible situations for
a node *W* along a path:

- ... --->
*W*---> ...This is blocked if

*W*is the variable*Z*that is conditioned on. - ... <---
*W*---> ...This is also blocked if

*W*is the variable*Z*that is conditioned on. - ... --->
*W*<--- ...This is blocked if

*W*is*not*the variable*Z*that is conditioned on, and*Z*is also not a descendent of*W*.

Note that even if *X* is not d-separated from *Y* given
*Z*, it could still be the case that *X* is C.I. of
*Y* given *Z*, but that will happen only if the exact
probabilities in the distribution lead to their being conditionally
independent - the structure of the graph doesn't imply that they are
C.I.

We can also define d-separation given more than one variable - a
node in a path of the form ---> *W* ---> or <--- *W* --->
blocks the path if *W* is any of the variables conditioned on, and
a node of the form ---> *W* <--- blocks the path only if neither
*W* nor any of its children is conditioned on.

We can find unconditional independence relationship in terms of
d-separation with no variables conditioned on. Then nodes of the
form ---> *W* ---> and <--- *W* ---> never block a path,
and a node of the form ---> *W* <--- always blocks a path.

For further information, you might want to read this tutorial by Richard Scheines.