# STA 2201S: Applied Statistics II Spring 2015

### Final Project due April 15 11.59pm

The project report should be between three and five pages, and be a non-technical summary of your analysis. This will not include any code, but it may include tables and plots. You shoudl make sure to have an introduction, to provide a detailed reference for the source of data, to state the scientific problem(s) of interest and your conclusions.

In a statistical appendix describe the main statistical methods used, give a summary of the statistical results, including what models were considered, what models formed the basis for the report above, and why. In this appendix you can include code excerpts, additional plots, and tables, as needed

Finally an executable file, either an R script or an R Markdown file or a knitr file is required, that will enable me to reproduce the results used in your report. This file should include the data frame that you constructed from your dataset, so that I don't need to use read.table or read.csv.

### Homework 3

Due April 1, 11.59 on Blackboard. On the Blackboard web page you can find the assignment under "Course Materials".

### Homework 2

Due March 6, 11.59 pm on Blackboard. On the Blackboard web page you can find the assignment under "Course Materials".
• Here is a paper Archer found that discusses choosing between quasi-Poisson and negative binomial. If you use the ideas in this paper for your homework be sure to include a reference.
• Q2(d). Q: can we choose between quasi-Poisson and negative binomial using AIC? A: I don't think you can use AIC for the quasi-Poisson, because there is not a genuine log-likelihood. I would rely on plots and on a study of the mean-variance relationship.
Q: If (ii) indicates that there is an association in one city but not in another, why would we be interested in (iii)? A: I *think* it could be the case in principle that you could have enough noise in the data that (iii) and (ii) could be compatible.
Archer found this resource, which is very clear. In particular, you might find it easier to think about the answers to the 3 parts by fitting sequences of Poisson GLMs of the form:(D = disease; B = blood group; C = city)
D + B + C, DC + B, DB + C, D + BC, DB + BC, etc.
and figuring out how these sub-models link with the 3 parts of the question.
• Q2(a): You will want to refer to the AOAS paper for answering this question. It is not a standard generalized linear model of the type I described in class, unless $$\nu$$ is considered fixed. So you can assume this for putting it in the GLM form. It is however a two-parameter exponential family, so if you interpret $$\theta = (\log\lambda, \nu)$$, then the question can be answered as stated. Either version is fine.
• Q2(d): Thanks to Alex-Antoine, for pointing out that the CMP model cannot be estimated using the Galapagos Island data. I've revised the question, suggesting to try the negative binomial model instead. (Which can be fit.)

It's possible that a rate model is better for this data, if we think that the number of species might be proportional to the area of the island. Bonus marks for exploring this.

• In Q1, the notation $$\underline y$$ means the vector of all the observations $$(y_{111}, \dots, y_{JKL})$$

• Homework Questions Feb 18: Q2(d) changed; Feb 13: Typos corrected Latex source
• Jager & Leek, for Q3
• Sellers & Shmueli, for Q2. This paper on generalized linear models with the Conway-Maxwell-Poisson distribution appeared in the Annals of Applied Statistics in 2010.

### Homework 1

• Marking Scheme
• Corrections and clarifications:
• On Jan.27, Q2 (b) and (d) were updated.
• Q3: Several students have asked: " what is the meaning of the main analysis of this endpoint?"
A: "Main analysis", means "what statistical analysis did they use to study this response". Often there is more than one, but one in particular that leads to the result emphasized in the abstract and conclusions. If there is more than one, just say so.
• Q2(d): The HW sheet was changed on Jan.27. As of today (Feb 3) You ONLY NEED to show the first part (with p's all equal).
• Q2(b): Use the result $$\sum y_i = \sum n_i \hat p_i$$, which is true as long as the design matrix has a column of 1's.
• Homework Questions Updated Jan 27
• Latex for Homework Questions
• Reference paper for Q1

### March 11

• Slides
• R script
• Jenny Bryan, again, this time with a Shiny App illustrating a catalogue of graphics and the R code to draw them

### February 4

• Slides
• Data Scientist the sexiest job of the 21st century" Harvard Business Review
• Yihui Li's web page for knitr
• More or Less podcasts on the BBC. "WS Global Wealth 24 Jan 15" discusses the Oxfam report. "WS Bad Luck and Cancer 10 Jan 15" reviews the Science article.

### Text

Extending the Linear Model with R by J. Faraway.

### Recommended

Statistical Models by A.C. Davison.
Principles of Applied Statistics by D.R. Cox and C.A. Donnelly

### Computing

You are welcome to use the statistical computing package of your choice, but I will refer exclusively to the R computing package. Some online resources that I've found helpful are: