STA414S/2104S: Statistical Methods for Data Mining and Machine Learning

January - April, 2010

Meets Tuesday 12-2, Thursday 12-1.
SS 2105

Course Information
This course will consider topics in statistics that have played a role in the development of techniques for data mining and machine learning. We will cover linear methods for regression and classification, nonparametric regression and classification methods, generalized additive models, aspects of model inference and model selection, model averaging and tree based methods.

Prerequisite: Either STA 302H (regression) or CSC 411H (machine learning). CSC108H was recently added: this is not urgent but you must be willing to use a statistical computing environment such as R or Matlab.

Office Hours: Tuesdays, 3-4; Thursdays, 2-3; or by appointment.

Textbook: Hastie, Tibshirani and Friedman. The Elements of Statistical Learning. Springer-Verlag.

Book web page, including a link to the online pdf of the book.

Course evaluation:

  • Homework 1 due February 11: 20%,
  • Homework 2 due March 4: 20%,
  • Midterm exam, March 16: 20%,
  • Final project due April 16: 40% (See January 5 handout for Project information).

Final Project: Due April 16, before 2 pm. You may email your project to me, but make sure you get an email reply before the deadline. If you do not near back from me, I didn't get your project, and you'll need to bring a hard copy.

I will not hold office hours April 5 to 9, but will be available by email.

Midterm: to be returned on April 6, 12 pm; room SS 6004

Homework 2 : Some comments

Homework 1 : Finally!, Sketch of solutions

Material from lectures

Mar 30

  • Slides, March 30. Overview of the course and the Netflix Prize solutions
  • All the papers I used are available through the Netflix website, see 2nd slide

Mar 23, 25

Mar 16

  • No Class on March 18, but office hours at the usual time
  • Slides
  • Handout with R code
  • Random Forests page, maintained by Adele Cutler
  • Electronic Textbook on Classification and Regression Trees
  • I forgot to tell you about all the cool stuff happening soon at the Fields Institute
    • April 15 -- Robert C. Merton, Harvard, Nobel-Prize winning economist
    • April 21-23 -- Darrell Duffie, Stanford
    • April 29 -- Statistics Grad Students' Research Day
    • April 30, May 1 -- DASF III: Workshop: Data Analysis and Statistical Foundations
    • May 3, 4 -- Jianqing Fan, Princeton: Distinguished Lecture Series in Statistics

Mar 9, 11

Mar 2, 4

February 23, 25

February 9

February 2

  • Slides
  • February 4: Li Li has office hours in SS 2005 1-2 pm, SS 6027a 2-3 pm My office hour is cancelled on Thursday.

January 26, 28

  • Wine data for HW 1
  • Slides for both Tuesday and Thursday lectures Amended on Thursday Jan 28
  • HW 1: mistake in Q1 Expression for W in part (b) should have $\hat\beta_{(0)}$ instead of 0 in second log-likelihood term. Posted version below has been corrected.

January 19, 21

January 12

January 5, 7


I will refer to, and provide explanations for, the R computing environment. You are welcome to use some other package if you prefer. There are many online resources for R, including:

Download R to your laptop using

A menu driven interface is available called R Commander.

I recently found a very nice short introduction to R basics from Charlotte Wickham, UC Berkeley.

Questions and clarifications re Midterm

  • Mar 16: There is a typo in the expression for f-hat.
    • It has been fixed in the currently posted version.
  • Mar 17: In Question 1, Is each of the x's a scalar? ie should we treat x_1, ..., x_N as each being a single number or a vector? The issue I see is that if it is a single number, then the class of estimators would seem to have no intercept (since all of the terms are multiplied by y_i and we have no column of 1's involved).
    • Yes, each of the x_i is a scalar; the exercise in the text has more specific details. With suitable choice of \ell_i, you can write $\hatf(x_0)$ as shown, for the case of linear regression (the only one I've checked); the bit that includes constant term \hat\beta_0 becomes part of the weight.

Some questions/answers re HW 1

  • Do you have any hints for Q2?
    • As pointed out by Hala, two matrices span the same subspace if one is a (full rank) linear transformation of the other. This should help...

  • How do I create a training data set using the sample function in R?
    • try ?sample; this will show you how to get a random sample of integers. you use these as row labels.

      myrows = sample(... ) ## I'll let you figure this part out

      mytrain = winedata[myrows,] ## choose these rows of the data.frame

      mytest = winedata[-myrows,] ## choose all the other rows of the data.frame

  • I have a specific question here: In the course slides there are lines like "> library(ElemStatLearn)". I'm just wondering where I can obtain this library package
    • Within R there is an option to install packages from cran. ON my Mac it's a menu item and you highlight "Package Installer". A new window opens, with "Get List". Once you have the list (you need to be online), you search for "ElemStatLearn", and then click install selected. Once all that is done, from within R you can load the package by typing "library(ElemStatLearn)". Sometimes the download and install doesn't work the first (or second...) time, so keep trying.

      But you can also get any of the book data sets into R 'by hand', by going to the book website and downloading the data file to your computer and using read.table or read.csv; you don't have to use the package ElemStatLearn. Note that the 'manual' for ElemStatLearn (on my web page) can be used to duplicate many of the analyses in the book.

  • Specifically, the way I understand it is using lm.ridge, 1. the "smallest GCV value" (output from select(ridge_regression)) is the optimal value of lambda. And that 2. for different sequences of lambdas passed to lm.ridge(model, data, lambda=sequence), it will produce different values for the optimal values of lambda (or smallest GCV) depending on (obviously) the start and end of the sequence and (not so obvious to me....) how dense you make the sequence...
    • yes, both of these are correct, although one would hope that the values don't depend too heavily on how dense the sequence is. I don't know if the difficulty is numerical, in the way things are implemented, or inherent in the GCV criterion. I haven't seen any discussion of this...
  • So, in particular to the wine quality problem, how does one choose the optimal lambda value to calculate the regression coefficients with?
    • hmmm