STA 2201S: Applied Statistics II Spring 2015

Final Project due April 15 11.59pm

The project report should be between three and five pages, and be a non-technical summary of your analysis. This will not include any code, but it may include tables and plots. You shoudl make sure to have an introduction, to provide a detailed reference for the source of data, to state the scientific problem(s) of interest and your conclusions.

In a statistical appendix describe the main statistical methods used, give a summary of the statistical results, including what models were considered, what models formed the basis for the report above, and why. In this appendix you can include code excerpts, additional plots, and tables, as needed

Finally an executable file, either an R script or an R Markdown file or a knitr file is required, that will enable me to reproduce the results used in your report. This file should include the data frame that you constructed from your dataset, so that I don't need to use read.table or read.csv.

Homework 3

Due April 1, 11.59 on Blackboard. On the Blackboard web page you can find the assignment under "Course Materials".

Questions(Updated Mar 20 to correct typos in Q4)
Latex source
Paper for Q3

Homework 2

Due March 6, 11.59 pm on Blackboard. On the Blackboard web page you can find the assignment under "Course Materials".

Here is a paper Archer found that discusses choosing between quasi-Poisson and negative binomial. If you use the ideas in this paper for your homework be sure to include a reference.
Q2(d). Q: can we choose between quasi-Poisson and negative binomial using AIC? A: I don't think you can use AIC for the quasi-Poisson, because there is not a genuine log-likelihood. I would rely on plots and on a study of the mean-variance relationship.
Q: If (ii) indicates that there is an association in one city but not in another, why would we be interested in (iii)? A: I *think* it could be the case in principle that you could have enough noise in the data that (iii) and (ii) could be compatible.
Archer found this resource, which is very clear. In particular, you might find it easier to think about the answers to the 3 parts by fitting sequences of Poisson GLMs of the form:(D = disease; B = blood group; C = city)
D + B + C, DC + B, DB + C, D + BC, DB + BC, etc.
and figuring out how these sub-models link with the 3 parts of the question.
Q2(a): You will want to refer to the AOAS paper for answering this question. It is not a standard generalized linear model of the type I described in class, unless \(\nu\) is considered fixed. So you can assume this for putting it in the GLM form. It is however a two-parameter exponential family, so if you interpret \(\theta = (\log\lambda, \nu)\), then the question can be answered as stated. Either version is fine.
Q2(d): Thanks to Alex-Antoine, for pointing out that the CMP model cannot be estimated using the Galapagos Island data. I've revised the question, suggesting to try the negative binomial model instead. (Which can be fit.)

It's possible that a rate model is better for this data, if we think that the number of species might be proportional to the area of the island. Bonus marks for exploring this.
In Q1, the notation \(\underline y\) means the vector of all the observations \((y_{111}, \dots, y_{JKL})\)
Homework Questions Feb 18: Q2(d) changed; Feb 13: Typos corrected Latex source
Jager & Leek, for Q3
Sellers & Shmueli, for Q2. This paper on generalized linear models with the Conway-Maxwell-Poisson distribution appeared in the Annals of Applied Statistics in 2010.

Homework 1

Marking Scheme
Corrections and clarifications:
- On Jan.27, Q2 (b) and (d) were updated.
- Q3: Several students have asked: " what is the meaning of the main analysis of this endpoint?"
  A: "Main analysis", means "what statistical analysis did they use to study this response". Often there is more than one, but one in particular that leads to the result emphasized in the abstract and conclusions. If there is more than one, just say so.
- Q2(d): The HW sheet was changed on Jan.27. As of today (Feb 3) You ONLY NEED to show the first part (with p's all equal).
- Q2(b): Use the result \(\sum y_i = \sum n_i \hat p_i\), which is true as long as the design matrix has a column of 1's.
Homework Questions Updated Jan 27
Latex for Homework Questions
Reference paper for Q1

April 1

Slides
Leslie Beck on Vitamin D, Globe & Mail March 29
Institute of Medicine's "explanation" of how the RDA for Vitamin D was determined
André Picard, Globe & Mail

March 25

March 18

Slides

March 11

Slides
R script
Jenny Bryan, again, this time with a Shiny App illustrating a catalogue of graphics and the R code to draw them

March 4

Slides Part 1
Slides Part 2
RMarkdown file for Part 2
Jenny Bryan's code to search cran for examples -- terrific!
Unreliable research picture from the Economist
the Cochrane Collaboration publishes reviews of the literature in health care and health policy

February 25

Slides
Just discovered these RStudio Cheatsheets -- Brilliant!

February 11

Slides (updated Feb 16, using photos of blackboard)
Measles web pages
- Royal Statistical Soceity's Significance Magazine
- National Health Service, UK, with links to published research
- Baird, et al. (2008) Case-control study finds no evidence of MMR vaccination link to autism
- The Lancet retracted the Wakefield 1998 paper in 2010.

February 4

Slides
iPad version
Data Scientist ``the sexiest job of the 21st century" Harvard Business Review
Yihui Li's web page for knitr
More or Less podcasts on the BBC. "WS Global Wealth 24 Jan 15" discusses the Oxfam report. "WS Bad Luck and Cancer 10 Jan 15" reviews the Science article.

January 28

Slides, which include links to many data sources
Data Science and R

January 21

Slides
iPad annotations
Economist news article on sea-level rise
Nature paper referred to in the article
R-Bloggers

January 14

Slides
Paper on teaching evaluations
Cancer risk: Links to the articles in the NY Times, Economist, and Science, are given in the slides, but other posts include
- Science reporter's reflection on the original item in Science News
- David Spiegelhalter's explanation at the Understanding Uncertainty web page
- The Guardian's post led with ``Please journalists, get a clue before you write about science.
- Similarly, this criticism is short and clear.
- A collections of links on this story has been published here.
- Of those, Thomas Lumley's is very clear, and helps sort out the log scale.

January 7

Slides
iPad slides annotated
Buzzfeed article
Information about knitr and Sweave
R code to reproduce analysis in slides
Report of the Presidential Commision on the Space Shuttle Challenger Accident. The oring data is about 1/3 of the way down this page.

Course Information

Text

Extending the Linear Model with R by J. Faraway.

Computing

You are welcome to use the statistical computing package of your choice, but I will refer exclusively to the R computing package. Some online resources that I've found helpful are:

RStudio, an IDE for R that has many useful features
Information about knitr and Sweave
Some tips and tricks for RStudio by Paul Chang
The official introduction, from CRAN
R Reference card
The R Cookbook
John Verzani's online book simpleR.
If you already know SAS/SPSS/Stata, you may find this Quick-R guide helpful.
Thomas Lumley's R course notes are often recommended on the LinkedIn R Project
Revolution Analytics has a list of several more R resources, including wikis and free online books.