We say that random variables X1, X2, ..., Xn are independent and identically distributed (abbreviated as i.i.d.) if all the Xi are mutually independent, and they all have the same distribution.
Examples: Put m balls with numbers written on them in an urn. Draw n balls from the urn with replacement, and let Xi be the number on the ith ball. Then X1, X2, ..., Xn will be i.i.d.
But if we draw the balls without replacement, X1, X2, ..., Xn will not be i.i.d. - they will all have the same distribution, but will not be independent.
If we draw the balls with replacement, but let Xi be i times the number on the ith ball, then X1, X2, ..., Xn will not be i.i.d. - they will be independent, but they will have different distributions.
Theorem: If X1, X2, ..., Xn are i.i.d. with the same distribution as a random variable X, then
E(X1 + X2 + ... + Xn) = n E(X)
Var(X1 + X2 + ... + Xn) = n Var(X)
SD(X1 + X2 + ... + Xn) = sqrt(n) SD(X)
We can view a random variable Y that has the binomial(n,p) distribution as the sum of n i.i.d. random variables that all have the Bernoulli(p) distribution. That is, we let
Y = X1 + X2 + ... + Xnwhere X1, X2, ..., Xn are i.i.d. with the Bernoulli(p) distribution.
Recall that the mean of a Bernoulli(p) random variable is p. We can also compute that the variance of a Bernoulli(p) random variable is
p (1-p)2 + (1-p) (0-p)2 = p (1-p)
From this and the theorem above about sums of i.i.d. random variables, we can conclude that if Y has then binomial(n,p) distribution, the E(Y) = np and Var(Y) = np(1-p).
If X1, X2, ..., Xn are i.i.d., all having the same distribution as the random variable X, then
E(BAR-X) = (1/n) E(X1 + X2 + ... + Xn)We say that BAR-X is an "unbiased" estimate of E(X), meaning that it's "right on average".
= (1/n) n E(X) = E(X)
But being unbiased doesn't guarantee that BAR-X is usually close to E(X). To see how close it is likely to be, we need to look at the standard deviation of BAR-X:
SD(BAR-X) = (1/n) SD(X1 + X2 + ... + Xn)So as n gets bigger, the standard deviation of BAR-X gets smaller, and approaches zero, though because of the square root, perhaps not as fast as you might have hoped.
= (1/n) sqrt(n) SD(X) = (1/sqrt(n)) SD(X)
The mean of a sequence of n i.i.d. random variables can be proved to approach (in a certain sense) the expected value of (any) one of the variables, as n goes to infinity. This is called the Law of Large Numbers. There are several forms of this law. The one I prove below is the "weak" form, and is restricted to random variables with finite variance (though the law can be proved without assuming finite variance using more sophisticated techniques).
Theorem: Let X be a random variable for which the expectation, E(X), exists, and for which the variance, Var(X), is finite. If the infinite sequence of random variables X1, X2, ... are i.i.d., all having the same distribution as the random variable X, then for any e > 0,
LIMIT(n->oo) P(|BAR-Xn-E(X)| > e) = 0where BAR-Xn = (X1 + X2 + ... + Xn) / n.
Proof: We need to show that for any e > 0 and any d > 0, there exists some integer n* such that for all n > n*,
P(|BAR-Xn-E(X)| > e) < dThis will hold if we set n* to Var(X)/de2. Chebyshev's inequality says that
P(|BAR-Xn-E(X)| >= e) = P(|BAR-Xn-E(X)| >= (e / SD(BAR-Xn)) SD(BAR-Xn)) <= (SD(BAR-Xn) / e)2Noting that SD(BAR-Xn) = SD(X)/sqrt(n), we see that for any n > n*,
P(|BAR-Xn-E(X)| > e) <= (SD(BAR-Xn) / e)2 < ((SD(X)/sqrt(n*)) / e)2 = Var(X) / (n*e2) = dwhich is what we needed to prove.
Note that the Law of Large Numbers implies that the fraction of times an event occurs in many repetitions of a situation converges to the probability of that event, since we can define a random variable X to be 1 if event A occurs and 0 otherwise, in which case P(A)=E(X). The Law of Large Numbers therefore provides a justification of sorts for the interpretation of probability as relative frequency in many repetitions.
The Law of Large Numbers is also a justification for using computer simulation to estimate probabilities and expectations.
A directed graphical model (also called a Bayesian network or a belief network) is a way of specifying a joint probability distribution for several random variables via a simplification of the multiplication rule.
Recall that any joint probability mass function, say for X, Y, and Z can be written (in notation abbreviating X=x to ust x) using the multiplication rule as
P(x,y,z) = P(x) P(y|x) P(z|x,y)If X is conditionally independent (C.I.) of Y given Z, we can simplify P(z|x,y) to P(z|y). When we have many random variables, such simplifications can make the task of specifying a probability distribution much easier.
A directed graphical model has a node for every random variable, with arrows from some nodes to other nodes, with there being no directed cycles. Since there are no cycles, there is always at least one way to order the variables so that all the arrows point forward. If we write the joint probability as above using the multiplication rule in this order, the absence of an arrow from X to a later node Y indicates that X is C.I. of Y given the parents of Y, so we can omit x from the factor P(y|...,x,...), thereby simplifying it.
Example: At some ski resort,
X = # of skiers on the hill one dayIt may be reasonable to assue that the total medical cost depends only on the number of skiers who break their leg, and not on the number of skiers, except that the number of skiers affects how many skiers break their leg. In other words, we might assume that X is C.I. of Z given Y. This is expressed by the directed graphical model below:
Y = # of skiers who break a leg that day
Z = total cost of medical treatment for broken legs that day
X ---> Y ---> Z
Example: We roll a red die and a green die, and then flip a coin as many times as the sum of the numbers showing on the red and green dice. We define
R = number showing on the red dieThis situation is described by the directed graphical model below:
G = number showing on the green die
X = R + G
H = number of heads when the coin is flipped X times.
R -- \ \ --> X ---> H --> / / G --This means that we can write the joint probability mass function for R, G, X, and H as
P(r,g,x,h) = P(r) P(g) P(x|r,g) P(h|x)
If we assume that some directed graphical model is valid, we can deduce various independence and conditional independence relationships from it.
A theorem (which I won't prove here) says that X is C.I. of Y given Z if X is "d-separated" from Y given Z, which means that all paths from X to Y in the graph are "blocked" given Z. A path is any route from X to Y, travelling along arrows in either direction. A path is blocked if any node along it blocks the path. There are three possible situations for a node W along a path:
This is blocked if W is the variable Z that is conditioned on.
This is also blocked if W is the variable Z that is conditioned on.
This is blocked if W is not the variable Z that is conditioned on, and Z is also not a descendent of W.
Note that even if X is not d-separated from Y given Z, it could still be the case that X is C.I. of Y given Z, but that will happen only if the exact probabilities in the distribution lead to their being conditionally independent - the structure of the graph doesn't imply that they are C.I.
We can also define d-separation given more than one variable - a node in a path of the form ---> W ---> or <--- W ---> blocks the path if W is any of the variables conditioned on, and a node of the form ---> W <--- blocks the path only if neither W nor any of its children is conditioned on.
We can find unconditional independence relationship in terms of d-separation with no variables conditioned on. Then nodes of the form ---> W ---> and <--- W ---> never block a path, and a node of the form ---> W <--- always blocks a path.
For further information, you might want to read this tutorial by Richard Scheines.