STA 414/2104: Statistical Methods for Machine Learning and Data Mining (Jan-Apr 2007)

Instructor: Radford Neal, Office: SS6016A, Phone: (416) 978-4970, Email: radford@stat.utoronto.ca
Office Hours: Thursdays 2:30 to 3:30, in SS 6016A.

Lectures:

Tuesdays, Thursdays, and Fridays, 1:10pm to 2:00pm, from January 9 to April 13, except February 20, 22, and 23 (Reading Week) and April 6 (Good Friday). Classes are in the Galbraith Building, on the east side of St. George Street just north of College Street. On Tuesdays and Thursdays, lectures are in GB221, and on Fridays, lectures are in GB244.

Textbook:

Christopher M. Bishop (2006) Pattern Recognition and Machine Learning, Springer. There's a webpage for the book here.

One important typo in the book: m1 and m2 should be swapped in equation 4.30 and similar places, or alternatively the inequality for classifying to class 1 stated below equation 4.20 should be reversed.

Assessment:

Three tests: 10% each, tentatively February 6, March 13, and April 13.
Four assignments: 15%, 15%, 15%, 25%
For graduate students in STA 2104, the final assignment will be an individual project.

The assignments are to be done by each student individually. Any discussion of the assignments with other students should be about general issues only, and should not involve giving or receiving written or typed notes.

Projects are also to be done individually, unless you can make a compelling case for doing a group project (no more than two people).

Computing:

Assignments will be done in R. Graduate students will use the Statistics/Biostatistics computer system. Undergraduates will use CQUEST. You can request an account on CQUEST if you're an undergraduate student in this course. You can also use R on your home computer by downloading it for free from http://lib.stat.cmu.edu/R/CRAN. From that site, here is the Introduction to R.

What to read in the textbook

Chapter 1, except you can skip 1.6 if you like.

Chapter 2, whatever parts you don't already know (except 2.3.8, which is irrelevant), but be sure to read 2.5.

Chapter 3.

Chapter 4.

Chapter 5, except you can skip 5.4 and 5.6 if you like.

Chapter 6.

Chapter 7, except you can skip 7.2.

Tests:

The first test, in class on February 6, covered material from Chapters 1, 2, and 3 of the text, as listed above. Here are the answers: Postscript, PDF.

The second test, in class on March 13, covered material from Chapters 4 and 5 of the text, as listed above, plus whatever earlier material is related to this new material. Here are the answers: Postscript, PDF.

The third test, in class on April 13, covered material from Chapter 6. Here are the answers: Postscript, PDF.

Some students have asked for the tests from last year. They do not necessarily have much to do with this year's tests, but if you want to see them, here they are: Test 1: Postscript, PDF, Test 2: Postscript, PDF.

Assignments:

Assignment 1: Postscript, PDF
Data files: a1-x-train, a1-t-train, a1-x-test, a1-t-test
Here are some notes on R for this assignment.
Here is the solution: R functions, R script, output, plots (postscript), plots (PDF), discussion.

Assignment 2: Postscript, PDF
Data files: a2a-x-train, a2a-t-train, a2a-x-test, a2a-t-test, a2b-x-train, a2b-t-train, a2b-x-test, a2b-t-test
Here are some notes on R for this assignment.
Here is the solution: Linear and quadratic discriminant functions, R script, output, plots for dataset A (postscript), plots for dataset A (PDF), plots for dataset B (postscript), plots for dataset B (PDF), discussion.

Assignment 3: Postscript, PDF
Data files: a3-x-train, a3-t-train, a3-x-test, a3-t-test
Here's an R function to plot a digit.
Here is the R function for binary classification with an MLP. Use this version, not the one on the CSC 411 web page - there's a bug fixed here.
Here is a solution: modified MLP functions, R script, output, plots. Note that your answers may have varied quite a bit depending on the exact choice of learning rate, etc.

Assignment 4: Postscript, PDF
Here are the Gaussian process regression functions. Correction: In the documentation before gp.cov.matrix, `residual variance' should actually be 'residual standard deviation'.
The gene expression data is associated with this paper by Gasch, et al.
Data files: a4a-x-train, a4a-x-test, a4a-y1-train, a4a-y1-test, a4a-y2-train, a4a-y2-test, a4b-x-train, a4b-x-test, a4b-y-train, a4b-y-test.
Note that if you're a grad student in STA 2104, you don't do this assignment. You do a project instead.
Here is the solution: R function to do tests, R script to do tests on all data sets, output of tests, discussion.

Some useful on-line references

Proceedings of the annual conference on Neural Information Processing Systems (NIPS)

Information Theory, Inference, and Learning Algorithms, by David MacKay

My tutorial on Bayesian methods for machine learning: Postscript or PDF.

UCI repository of machine learning datasets

Web pages for past related courses:

CSC 411 (Fall 2006)
STA 414 (Spring 2006)
CSC 321 (Spring 2006, Geoffrey Hinton)
CSC 411 (Fall 2005, Anthony Bonner)
CSC 411 (Fall 2004, Richard Zemel)
STA 410 (Spring 2004) - has many examples of R programs