
STA450S/4000S: Topics in statistics
Statistical Aspects of Data Mining
Spring, 2004
Textbook: Hastie, Tibshirani and Friedman. The Elements of Statistical Learning.
SpringerVerlag.
Book web
page
April 7, 2004
 Project due before Friday April 16 5.00 pm
 Slides
March 31, 2004
 No class on Friday April 2
 Kmeans clustering on the wine data works fine if the data are
standardized first; see the slides
 Slides
 See help files for various R programs for details on cluster methods.
March 24, 2004
 Homework 3 not due until Friday Mar 26
 No tutorial on Friday, but I will be available to answer questions
 Slides
 Pictures to illustrate clustering are Figures 14.6, 14.12, 14.13, 14.14
March 17, 2004
 Homework 3 not due until Friday, Mar 26
 Slides
(thanks to AnaMaria Staicu)
 Nice picture of regression tree
(also thanks to AnaMaria)
March 10, 2004
 On Friday March 12 I will answer questions about the homework.
 Next Friday (March 18) there will be no class.
 Lecture
notes (hand) .
 Pictures.
March 3, 2004
 Lecture
notes (hand) .
 Pictures.
 Handout
on fitting GAMs in R.
 Plot
from GAM in R on the heart data, showing the smooth curves for each
covariate, smoothing chosen automatically.
 Plot
from GAM in R on the heart data, showing the smooth curves for each
covariate, smoothing set to be equivalent to 4 df for each covariate. (This
is closer to the plot in the book; see Figures 5.4 and 6.12).
 Homework 3
due March 25, 2004.
February 25, 2004
 Slides from
today.
 On p.9, there are 4 plots, and the code on the preceding page is missing
two lines: First 4 plots to a page were obtained by
par(mfrow=c(2,2))
and then the top left plot is just the underlying function. The top right
shows the results from loess. The bottom left (I described this incorrectly
in class) compares loess with the kernel smooth on p.6, and you can see that
the more 'linear' behaviour of loess at the endpoints. Finally the bottom
right shows the loess fit using local linear regression (red) compared to
the loess fit using local linear regression with span 0.4 (green) and local
quadratic reression with span 0.4 (purple).
February 11, 2004
 No tutorial class on February 13.
 Homework #1 is due on February 13 at 4 pm, in SS 6002a or SS 6018.
 Homework 2 is
due on March 3.
 Slides from
today.
 Have a nice midterm break!
February 6, 2004
February 5, 2004
Typo on question two of HW 1 (in the expression for the prior density for
beta, sigma should be replaced by tau). The online version has been corrected.
(See below for link.)
February 4, 2004
January 30, 2004
 Tutorial cancelled today. You can now run R directly on any of the Cquest
workstations by typing "R" in a command window. It may be on the
menu soon. If you are logging in remotely, still easier to use "/u/radford/R".
January 28, 2004: Classification
 News: No office hour Friday at 2 (January 30). Guest lecturer (Rafal
Kustra) February 4. (Chapter 5.1ff)
 Re Homework 1: All computer code used to reach conclusions should be
submitted, but as *appendices* only. The answers to the questions should not
include computer code, although relevant graphs may be included.
 I meant to mention: Logistic regression is discussed in Chapter 14 of the
302 textbook. Also Section 4.2 of the book discusses linear regression of y
on x, when y is a 01 variable. It is essentially the same as lda, but
doesn't generalize well to more than two classes.
 Slides
January 23, 2004: Ridge regression in R
 Here is
a copy of an R session I did on cquest. It shows ridge regression. (It's
pretty raw!) You can also do all possible subsets by first typing library("leaps",lib.loc="/u/reid");
see help(leaps) for more info.
January 21, 2004: Model Selection; Intro to Classification
Sections 3.4.03.4.3, 2.2, 4.4.1
 copy of
slides
 Homework 1 due
February 11. The data ("abalone.data") is in "/u/reid"
on Cquest. A description of the data ("abalone.desc") is in the
same place, and provided here
for those of you who are not using Cquest. You can download
the data from the UCI Machine Learning Site as well.
 The teaching assistant for the course is AnaMaria Staicu. She has an
office hour on Thursday from 12 to 1 in the Stat Aid centre.
January 16, 2004: Using R for linear regression
 My code
for fitting the linear regression model in the text.
 Help text for
the "lm" routine.
January 14, 2004: Linear Regression
We covered 3.13.3. and a little of 3.4. Next week we will finish Chapter 3 and
start Chapter 4.
January 7, 2004
If you missed the class and are interested in taking it please email me asap.
Undergraduates have the choice of downloading R to their own computer, or using
R or Splus on cquest. You are also welcome to use Matlab. To get an account on
Cquest go to the Cquest home page and
request an account.
There is a wealth of information about the NMMAPS study on Francesca
Dominici's web site .
January 9, 2004: Trying R
Once you have a Cquest account, or have downloaded R onto your computer, try
running R (/u/radford/R on Cquest) and having a look at the prostrate
cancer data set. This is available on the book
web page , but I have also put it into /u/reid on Cquest, so the following
*should* work:
pr< read.table("/u/reid/prostate.data",header=T)
Then you can try things like dim(pr) and names(pr) and so on.
Here is a set of annotated
commands that you can try in R or Splus.
January 8, 2004: announcement
This announcement was also sent by email who's email address I have. If you know
someone who wants to get class emails not on this list, please tell them to
email me asap.
The hour from 12 each Friday will be used as a tutorial, particularly with
regard to computing. It is fine with me if you attend STA 410 at that
time instead, this will provide you with a lot of computing expertise and you
probably won't need my help. Our office staff is checking to see if ROSI will
let students enrol for both courses, any feedback you have for me on this would
be helpful. Note also that I have an office hour on Friday from 23, so if you
are attending 410 and have some particular computing/course issues to discuss
with me, you could ask me during that hour instead. I am trying to find a room
in SS for that hour but for the moment my office will be the location.
This Friday, come to 2111 at 1.10 (if you are not attending 410). We
are still sorting out the details of the graduate portion of the class, so the
grad students will be there to discuss various possible meeting times with
Professor Kustra. I will accompany anyone from 450 who is interested to the
Cquest Lab in Sid Smith, to provide any needed help in getting an account, etc.
The web pages for STA
410/2102 and STA
2201 have introductory material on R. The first has instructions for running
on Cquest, and the latter assumes you are running it on fisher/utstat.
[ Home  Information 
Research  Teaching  Miscellaneous
]
