STA 437 / 1005 - Methods for Multivariate Data (Sep-Dec 2009)

Notes: You can come by to pick up the marked asignment 3 Monday, Jan 18, from 3:10 to 4:20. The solution is below. Note that my office has moved to SS6026A.

STA 437 is the undergraduate version of this course. STA 1005 is the graduate version, which may be taken for credit only by graduate students who are not in Statistics.

Instructor:

Radford Neal
Phone: (416) 978-4970
Office: SS6026A Email: radford@stat.utoronto.ca

Office hours: Thursdays, 4:40pm to 5:30pm, in SS6016A.

Lectures:

Mondays 6:10pm to 9:00pm, from September 14 to November 30, except for October 12 (Thanksgiving), plus Wednesday November 11 from 6:10pm to 9:00pm, which makes up for the lecture missed on Thanksgiving.

Lectures are in Sidney Smith Hall, 100 St. George Street, room 2110.

Textbook:

R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 6th edition.

You can get the datasets used as examples in the text, plus some proofs omitted from the book, from this web page. Click on "Take a closer look".

Computing:

Some assignment questions will require use of the R statistics package. You can use this package on the CQUEST computer system, or install it for free on your own computer (MS Windows, Macintosh, or Linux).

You'll be able to get a CQUEST account once classes start at www.cquest.utoronto.ca.

The R package and documentation are at www.r-project.org. Here are some direct links to things available there:

Evaluation:

30% Three assignments, worth 6%, 12%, 12%.
25% Mid-term test, scheduled for Oct. 26, BA 1170, 6:10-8:30.
45% Final exam, scheduled by the Faculty during the exam period.

The first assignment has pen-and-paper exercises, and is due Oct. 8.

The second and third assignments will involve substantial data analysis using R. The third will be handed in in two parts. The first part will be your solution. I will then release a model solution. A week later, you will hand in a critique of your solution, identifying what you think you did right and wrong. The grade for the assignment will be based on both parts.

Assignments:

NOTE: The assignments are worth 6%, 12%, and 12% of the course grade, as said above. Ignore any contrary information on the assignment handouts.

Assignment 1: handout, solutions.

Assignment 2: handout, data set 1, data set 1 description, data set 2, data set 2 description, hints on using R.
Note: There's a typo in the assignment. Where it says "height to the power p divided by weight to the power q", it should read "weight to the power p divided by height to the power q". Also, you may find it useful to use the "apply" function with second argument of 1 in order to find means or medians of a set of variables. And you may find it useful to select a subset of observations in a data frame with something like d [d$class==2, ].
Model solutions: data set 1, data set 2.

Assignment 3: handout, data set 1 (as modified), data set 1 description, data set 2, data set 2 description, hints on using R.
Model solutions: data set 1, data set 2.

Test:

Held October 26, in Bahen room 1170, from 6:10 to 8:30.

Here are the questions from last year's midterm test. Note that the last question is on material that won't be covered on this year's mid-term test.

The test will cover all material from lectures so far (and related material from the book). It will be closed book, no books or notes. Calculators will not be needed. I will provide any really complicated formulas needed, but you should remember the simple ones.

Here is the test paper and the answers.

Final exam:

Held Wednesday, December 9, 7-10pm, in BN3 (Benson Building).

The final exam will cover the whole course, but with more emphasis on material covered since the mid-term test.

Here is the front page of the final exam, which includes some formulas for reference. Formulas substantially more complex will not be needed. Formulas that are simpler and of central importance you are expected to remember.

Here is last year's final exam. Note: In question 2(a), "Give two other" should be "Give one other".

Lecture topics:

We covered most of Chapters 1 to 5, part of Chapters 6 and 7, most of Chapters 8 and 9, and part of Chapter 11.

You should now have read Chapters 1, 2, 3, 4, 5, 6, 8, 9, and 11 of the text, and we've also briefly looked at Chapter 7.

Here are the topics and sections covered each week (this list may not be complete).

Sep. 14: Topics and applications of multivariate analysis, Data organization, Sample statistics, Scatterplots, Demonstration of R and of plots for data analysis. R scripts used are here and here. Text: 1.1-1.4

Sep. 21: Review of sample statistics; Meaning of a random sample; Means, covariances, correlations for random vectors; Estimation of mean, covariance, etc. from sample statistics; Effects of linear transformations; Start of discussion of normal distribution. Text: 2.5-2.6, 3.3, 3.6, 4.1-4.2.

Sep. 28: Multivariate normal distributions, MVN density function, positive definite matrices, properties of multivariate normal; Eigenvalues and eigenvectors, especially of covariance matrices; Distribution of sample mean and covariance. Text 2.3, 4.1-4.2, 4.4.

Oct. 5: Central Limit Theorem; Maximum likelihood estimation; Statistical distance; Assessing normality and finding outliers, QQ plots; Transformations to make data closer to being normally-distributed. Text: 1.5, 4.3, 4.5-4.8.

Oct. 12: THANKSGIVING. No lecture.

Oct. 19: Review of testing hypotheses about mean of univariate normal distribution with t statistic; Introduction to testing hypotheses about mean of multivariate normal distribution with T2 statistic. Text: 5.1-5.2.

Oct. 26: MIDTERM TEST. No lecture.

Nov. 2: Answers to the midterm test questions; T2 test as a likelihood ratio test; confidence regions from T2 test; simultaneous confidence intervals from T2 confidence region and from Bonferroni correction. Text: 5.3-5.4.

Nov. 9: Principal component analysis; Introduction to factor analysis. Text: 8.1-8.4, 9.1-9.2.

Nov. 11: More on factor analysis; Demo of PCA and factor analysis in R. Text: 9.3-9.5.

Nov. 16: Comparing means: paired data, repeated measures designs, two samples with equal and unequal covariance. Text: 6.1-6.3.

Nov. 23: Multivariate Analysis of Variance (MANOVA): one-way, brief look at two-way; brief look at random effects models (not in textbook, not on exam); Brief look at multivariate regression. Text: 6.4-6.7, brief look at material from Chapter 7.

Nov. 30: Classification: making decisions based on Bayes Rule and mis-classification costs, classification when the classes have multivariate normal distributions, brief mention of logistic regression. Brief mention of control charts. Text: 11.1-11.3, 11.7, 5.6

Web page for previous version of this course: