STA442/2101F: Applied Statistics I

Tuesday 1 - 4 pm, SS 2106

Exam Marks have been posted to Blackboard

The official marks need to come from ROSI, but this will give you a good guideline to what you can expect.

Solutions to Homework 3.
Problem sets should be ready for return by Monday, Dec. 14

Office Hours in December: Tuesday, Dec. 8 4-5 pm; Monday, Dec. 14 3-4 pm; Tuesday, Dec. 15, 12-2 pm.

Course description

December 1

Example Q
List of topics covered in the course
Here are some practice questions for the Exam.

November 24

Solutions to Homework 2, thanks to Wei Lin
overdispersion in Poisson and binomial regression (Sec. 10.6.2)
Kaplan-Meier estimator of survival distribution
In Sec. 5.4.3 this is derived by likelihood arguments, but more direct arguments are more intuitive
Cancerguide.org , for example, gives some intuition, as does Wikipedia
Cox proportional hazards regression (Sec. 10.8.2: with a very good explanation of time-dependent covariates on p.547)
Full paper by John Fox on using the survival package. (Only the first 6 pages were handed out in class.)

November 17

HW 3, due Dec 1
Journal Article for HW 3
News item for HW 3
Reading for next week; another view of the Redelmeier and Singh paper
CSV file of Oscar data; read.csv("oscar.csv",header=T,skip=2) seems to work
Scan of lecture notes
Note on Simpson's paradox
Note on survivor treatment selection bias

November 10

Re: HW 2: ("I don't understand why I'm not getting all the coefficients estimates") -- You don't get all the estimates because R imposes a constraint on the coefficients so that they will be estimable. One part of the question asks you to state the constraint. The output of glm includes a value called contrasts, which will tell you what constraint was used. In class in the discussion of one-way anova we referred to summation contrasts, and what the book calls 'corner-point contrasts', but R calls these "treatment contrasts". See Section 9.2.1 for the one-way layout, and p.429 for a two-way layout.
The barley data on the handout of October 13 on one-way anova is avaialable here; if you save this file as barley.dat in a local directory you should be able to load it with read.table(barley.dat), but you can also source the file from my web page.
Lecture Notes
Please read the Redelmeier and Singh article "Survival in Academy Award winning actors and actresses" for next week.
Partitioning sums of squares in linear regression is discussed in Sec. 8.5. There is as very brief and elegant description of contrasts in Sec. 9.3.2.
Sec. 10.2 considers a fairly general nonlinear model, and describes likelihood inference and fitting. Sec. 10.3 considers in more detail a special class of nonlinear regression models built on exponential families; generalized linear models. These are treated in detail in Applied Statistics II.

November 3

HW 2, Question 3: if you use phat as the response in glm, then you need to use weights = total, so it knows what the sample size is for each proportion
Notes
Example 10.18 R code
Some notes on likelihood inference.
Wikipedia has a good discussion of the Challenger disaster, with links to Edward Tufte's work. The data appears in Davison in Table 1.3, and the statistical (mis) analysis is discussed in Dalal et al. (1989, JASA)
You can have fun at the next seminar if you first read Tufte's rant on powerpoint; an excerpt is available on Tufte's web page

October 27

Solutions to HW 1
If you feel that the regrading of you homework is necessary, please write or type a full explanation of your request on a separate piece of paper and submit your request with your homework solution to the instructor. Your claim will then be examined, and if appropriate an adjustment to your mark will be made. But be aware that the regrading work will reexamine all your solutions, which might lead to higher or lower mark.
Next week the 3rd hour will be for HW 2 questions. The 1st hour will include material needed for HW 2 questions 3 and 4, but you should be able to do Questions 1 and 2 with the material we have covered, and with Chapter 9.2.
Lecture notes
Notes Wei Lin found on Type I,II,III SS

October 20

Homework 2 due November 10.
Scan of lecture notes.
Handout for this (and next) week: Example H

October 13

HW 1 Question 6 (b): this code will work. It isn't pretty, and you don't need to do it to answer the question.
> library(nlme) > rat.lme = lme ( ratvector ~ x, random = ~ 1 + x | factor(number)) > summary(rat.lme) where ratvector is the string of 75 values, x = week - 1 so takes values 0,1,2,3,4,0,1,2,3,4 ..., and number is the number of the rat, so takes values 1,1,1,1,1,2,2,2,2,2 ...
Handout for next week: Example H
Handout for next week: One way anova
Topics covered: sums of squares, sequential sums of squares, adjusting for terms in the model, anova, Anova and summary commands in R, factor variables, orthogonal polynomials, accelerated life testing and multiplicative models

October 6

Amendments to HW 1: Question 3(d) is not required: it is a bonus.
Question 6 (a) only is due on October 13; (b) and (c) not due until October 20
There is a missing line of data for Question 6. Rat 10 is missing. The 5 values for Rat 10 are: 134, 182, 220, 260, 296. The data set appears in the text on p.460.
NPR blog on the spanking and IQ story. This has a link to the published paper.
Anova is in the package car: try help.search("car"), or install the package car and then from within R ?car tells you about the package. ?Anova explains (briefly) Type II and III sums of squares. From "Applied Statistics and the SAS programming Language" by Cody and Smith:
TYPE I lists the sums of squares for each variable as if it were entered one at a time into the model, in the order they are specified in the MODEL statement. Hence they can be thought of as incremental sums of squares. If there is any variance that is common to two or more variables, the variance will be attributed to the variable that is entered first. This may or may not be desirable. The TYPE III sum of squares gives the sum of squares that would be obtained for each variable if it were entered last into the model. That is, the effect of each variable is evaluated after all other factors have been accounted for. In any given situation, whether you want to look at TYPE I or TYPE III, sum of squares will vary; ...
The Anova command refines this slightly by respecting marginalization.
The car package (and many others) are loaded along with Rcmdr.
If you construct the usual anova table, which gives the SS in the order in which they are entered, then the SS due to all the variables entered adds up to the "Regression SS" with p degrees of freedom. Type II/III SS do not have this property.
Topics covered: the above, plus Section 8.5: fitting successive models.

September 29

Example J handout
Gelman and Weakliem paper in American Scientist. (This link only works from within U of T.)
handout using step
Topics: Transformations, Residuals, Influence Diagnostics, Model Selection (Ch 8.6, 8.7; points (i) - (v) in Ex G preamble.)

September 22

Example G handout, annotated (after p.5)
Homework 1, due October 13. Note that Question 1(c) should read "Continue with Exercise 8.5.1 on p.385".
Cement data for HW 1
R data frame for HW 1, Qu. 2 (but you probably don't want to click on this link, rather source the data into your R session.
I lost the R session that I did in class, but here is my best reconstruction of what I did
Book Sections: 8.2.1, 8.2.2, 8.3.1, 8.3.2, 8.5.1, 8.5.2, 8.6.1, 8.6.3
Next Week: Example G and Sections 8.6, 8.7, 8.4 (and 8.5)

September 15

R handout on Venice data
Example G for next week
We reviewed simple linear regression, Chapter 5.1 of the text; next week we will consider Chapter 8.1-8.3 along with Example G.
Make sure you know how to duplicate the results on the Venice sea level analysis in the package of your choice.

Text

Statistical Models by A.C. Davison. We will emphasize Chapters 5, 8, 9 and 10.
October 13: New copies of the book have arrived in the bookstore. Also, a somewhat scrappy scan of Chapter 8 and parts of Chapter 9 is available on the course page in Blackboard.

Computing

You are welcome to use the statistical computing package of your choice, but I will refer exclusively to the R computing package. Statistics Dept graduate students can access R on the Statistics Dept computers; undergraduate students can access R on CQUEST. Alternatively, students can install R on the computer(s) of their choice, by downloading its "base" package (for free) from probability.ca/cran or www.r-project.org.
There are many helpful introductions to R, including:

Jeff Rosenthal's introduction for the course STA410/2102,
the documentation provided on the R project web site (especially the Introduction to R),
John Verzani's online book simpleR,
Radford Neal's notes for STA 410 has some demonstration of R data types and R programming