% Data Analysis with SAS: An Open Textbook % \documentclass[12pt,openany]{book} % Default for books is openright: to start chapters on a new page -- maybe later \usepackage{euscript} % for \EuScript \usepackage{amsbsy} % for \boldsymbol and \pmb \usepackage{amsmath} % For binom \usepackage{amsfonts} % for \mathbb{R} The set of reals \usepackage{amssymb} % for \blacksquare \mathbb \usepackage{graphicx} % To include pdf figures \usepackage{pdfpages} % Include entire pdf documents \usepackage{lscape} % For the Landscape environment \usepackage{color} % \textcolor{blue}{...} \usepackage[colorlinks=true, pdfstartview=FitV, linkcolor=blue, citecolor=blue, urlcolor=blue]{hyperref} % pagebackref=true ? \newtheorem{quest}{Sample Question}[section] % Numbered within section \newtheorem{answ}{Answer to Sample Question}[section] \newtheorem{hint}{Data Analysis Hint} %\newcounter{exer} \oddsidemargin=0in % Good for US Letter paper \evensidemargin=0in \textwidth=6.3in \topmargin=-0.5in \headheight=0.2in \headsep=0.5in \textheight=8.8in %\textheight=8.4in %\textheight=9.4in \title{Data Analysis with SAS\footnote{And a little R near the end.}: An Open Textbook \\ Edition 0.9} \author{Jerry Brunner \\ \\ \small{Department of Statistical Sciences, University of Toronto} \\ \small{\href{http://www.utstat.toronto.edu/~brunner} {http://www.utstat.toronto.edu/$^\sim$brunner} } } \date{\today} \begin{document} \frontmatter \maketitle \bigskip \begin{quote} Copyright \copyright{} 2016 Jerry Brunner. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in Appendix~\ref{fdl}, entitled ``GNU Free Documentation License''. \end{quote} \bigskip \pagebreak \tableofcontents \chapter{Preface to Edition 0.9} \section*{This book is free and open source} From the perspective of the student, possibly the most important thing about this textbook is that you don't have to pay for it. You can read it either online or in hard copy, and there are no restrictions on copying or printing. You may give a copy to anyone you wish; you may even sell it without paying royalties. The point is not so much that the book is free, but that \emph{you} are free. The plan for publishing this book is deliberately modeled on open source software. The source code is \LaTeX\. There are also some graphics files, most of which were produced with \texttt{R}. The \texttt{R} code appears as comment statements in the \LaTeX\ source. There are also some modifiable graphics files in the open \texttt{svg} format). Continuing the analogy to open source software, the compiled binary is a PDF or DjVu file. Everything is available at \begin{center} \href{http://www.utstat.toronto.edu/~brunner/DataAnalysisText} {\texttt{www.utstat.toronto.edu/$^\sim$brunner/DataAnalysisText}}. \end{center} This document is distributed without any warranty. You are free to copy and distribute it in its present form or in modified form, under the terms of the GNU Free Documentation License as published by the \href{http://www.fsf.org}{Free Software Foundation}. A copy of the license is included in Appendix~\ref{fdl}. In case this appendix is missing, the Free Documentation License is may be found at \begin{center} \href{http://www.gnu.org/copyleft/fdl.html} {\texttt{http://www.gnu.org/copyleft/fdl.html}}. \end{center} \section*{The Approach of the Book} This book is about using statistical methods to draw conclusions from real data. The methods are intermediate to advanced, and the student should have had at least one Statistics class at some time in the past. The course (or courses) can be at any level, mathematical or not. The important thing is that the student have some exposure to concepts like null hypothesis and $p$-value, or else the treatment in Chapter~\ref{rmethods} will go past too rapidly for comfort. But but while data analysis uses statistical methods, it's not just Statistics. The enterprise consists of research design, data processing, \emph{and} applications of Statistical methods; you need to think about the parts pretty much at the same time in order to do a decent job. Research design is vital because the numbers that are collected and the way they are collected determine the information they contain. So research design places limits upon the conclusions that can be drawn from a given data set, regardless of what statistical technique is used. And while the details of how data are processed prior to the actual analysis may not have a lot of intellectual value in itself, but it can have a huge impact on the quality of the final result. So we will not hesitate to get our hands dirty. Occupying a territory somewhere between descriptive statistics and data processing is data \emph{checking and cleaning}. Almost all real data sets contain errors, and some of them can be located and fixed during the data analysis. The practical importance of checking and cleaning the data can scarcely be exaggerated\footnote{For example, in one study designed to predict students' Calculus marks, one of the predictors was High School grade point average (GPA), a number from zero to 4.0. There were some zeros, but they meant that the students' actual GPAs were not recorded for some reason --- and nobody told the statistician. Consider the consequences of calculating means and regression coefficients and so on without first checking the data.}. As the old saying goes, ``Garbage in, garbage out." A lot of the book is about statistical ideas. The presentation is deliberately non-mathematical\footnote{When I cannot resist the impulse to say something requiring a background in mathematical statistics, I'll try to put it in a footnote. Footnotes may contain other kinds of digression as well.}, relying on translations of statistical theory into English. For the most part, formulas are avoided. While this involves some loss of precision, it also makes the course accessible to students from non-statistical disciplines (particularly graduate students and advanced undergraduates on their way to graduate school) who need to use statistics in their research. Even for students with strong training in theoretical statistics, the use of plain English can help reveal the connections between theory and applications, while also suggesting a useful way to communicate with non-statisticians. We will avoid mathematics, but we will not avoid computers. Learning to apply statistical methods to real data involves actually doing it, and the use of software is not optional. Furthermore, we will \emph{not} employ ``user-friendly" menu-driven statistical programs. Why? \begin{itemize} \item It's just too easy to poke around in the menus trying different things, produce some results that seem reasonable, and then two weeks later be unable to say exactly what one did. \item Real data sets tend to be large and complex, and most statistical analyses involve a sizable number of operations. If you discover a tiny mistake after you produce your results, you don't want to go back and repeat two hours of menu selections and mouse clicks, with one tiny variation. \item If you need to analyze a data set that is similar to one you have analyzed in the past, it's a lot easier to edit a program than to remember a collection of menu selections from last year. \end{itemize} To clarify, the word ``program" does \emph{not} mean we are going to write programs in some true programming language like C or Java. We'll use statistical software in which most of the actual statistical procedures have already been written by experts; usually, all we have to do is invoke them by using high-level commands. The statistical programs we will use are \texttt{SAS} and to a \emph{much} lesser extent, \texttt{R}. These programs are command-oriented rather than menu-oriented, and are very powerful. They are industrial strength tools. \section*{Message to the Instructor} Among commercial books I know, Ramsey and Schafer's \emph{The Statistical Sleuth}~\cite{sleuth} comes closest to this book in its goals and level. In my view, Ramsey and Schafer's text is much better than this one; their range of statistical methods is broader, and in particular their examples and sample data sets are wonderful. The advantage of the book you're reading is that it's free, and also (just from my personal perspective) I find Ramsey and Schafer's relentless model-building approach to data analysis a little tiring. Maybe in time this book will approach the \emph{Statistical Sleuth} in quality, especially if other people help clean it up and contribute some chapters. In the meantime, one could do worse than requiring students to use the present text, placing Ramsey and Schafer on reserve, and using some of their examples in lecture. Earlier versions of this text presented SAS running in a unix/linux environment. This was convenient at the University of Toronto, where students can log in remotely to unix servers running SAS, and use the software without charge. All that has changed with the introduction of SAS University Edition, which is available free of charge to anyone with a university email address. It's really better and more convenient in most ways, so starting with Edition 0.9, all references to the operating system (including unix text editors, ssh access and so on) will be eliminated, and just the SAS programs, log files and output will be presented. Details of how to use SAS University Edition are best given in lecture. \mainmatter \chapter{Introduction}\label{rmethods} \section{Vocabulary of data analysis} \label{vocab} We start with a \textbf{data file}. Think of it as a rectangular array of numbers, with the rows representing \textbf{cases} (units of analysis, observations, subjects, replicates) and the columns representing \textbf{variables} (pieces of information available for each case). There are $n$ cases, where $n$ is the sample size. \begin{itemize} \item A physical data file might have several lines of data per case, but you can imagine them listed on a single long line. \item Data that are \emph{not} available for a particular case (for example because a subject fails to answer a question, or because a piece of measuring equipment breaks down) will be represented by missing value codes. Missing value codes allow observations with missing information to be automatically excluded from a computation. \item Variables can be \textbf{quantitative} (representing amount of something) or \textbf{categorical}. In the latter case the ``numbers" are codes representing category membership. Categories may be \textbf{ordered} (small vs. medium vs. large) or \textbf{unordered} (green vs. blue vs. yellow). When a quantitative variable reflects measurement on a scale capable of very fine gradation, it is sometimes described as \textbf{continuous}. Some statistical texts use the term \textbf{qualitative} to mean categorical. When an anthropologist uses the word ``qualitative," however, it usually refers to ethnographic or case study research in which data are not explicitly assembled into a data file. \end{itemize} Another very important way to classify variables is \begin{description} \item[Explanatory Variable:] Predictor = $X$ (actually $X_i, i = 1, \ldots, n$) \item[Response Variable:] Predicted = $Y$ (actually $Y_i, i = 1, \ldots, n$) \item[Example:] $X$ = weight of car in kilograms, $Y$ = fuel efficiency in litres per kilometer \end{description} \begin{quest} Why isn't it the other way around? \end{quest} \begin{answ} Since weight of a car is a factor that probably influences fuel efficiency, it's more natural to think of predicting fuel efficiency from weight. \end{answ} The general principle is that if it's more natural to think of predicting $A$ from $B$, then $A$ is the response variable and $B$ is the explanatory variable. This will usually be the case when $B$ is thought to cause or influence $A$. Sometimes it can go either way or it's not clear. Usually, it's easy to decide. \begin{quest} Is it possible for a variable to be both quantitative and categorical? Answer Yes or No, and either give an example or explain why not. \end{quest} \begin{answ} Yes. For example, the number of cars owned by a person or family. \end{answ} In some fields, you may hear about \textbf{nominal, ordinal, interval} and \textbf{ratio} variables, or variables measured using ``scales of measurement" with those names. Ratio means the scale of measurement has a true zero point, so that a value of 4 represents twice as much as 2. An interval scale means that the difference (interval) between 3 and 4 means the same thing as the difference between 9 and 10, but zero does not necessarily mean absence of the thing being measured. The usual examples are shoe size and ring size. In ordinal measurement, all you can tell is that 6 is less than 7, not how much more. Measurement on a nominal scale consists of the assignment of unordered categories. For example, citizenship is measured on a nominal scale. It is usually claimed that one should calculate means (and therefore, for example, do multiple regression) only with interval and ratio data; it's usually acknowledged that people do it all the time with ordinal data, but they really shouldn't. And it is obviously crazy to calculate a mean on numbers representing unordered categories. Or is it? \begin{quest} Give an example in which it's meaningful to calculate the mean of a variable measured on a nominal scale. \end{quest} \begin{answ} Code males as zero and females as one. The mean is the proportion of females. \end{answ} It's not obvious, but actually all this talk about what you should and shouldn't do with data measured on these scales does not have anything to do with \emph{statistical} assumptions. That is, it's not about the mathematical details of any statistical model. Rather, it's a set of guidelines for what statistical model one ought to adopt. Are the guidelines reasonable? It's better to postpone further discussion until after we have seen some details of multiple regression. \section{Statistical significance} We will often pretend that our data represent a \textbf{random sample} from some \textbf{population}. We will carry out formal procedures for making inferences about this (usually fictitious) population, and then use them as a basis for drawing conclusions from the data. Why do we do all this pretending? As a formal way of filtering out things that happen just by coincidence. The human brain is organized to find \emph{meaning} in what it perceives, and it will find apparent meaning even in a sequence of random numbers. The main purpose of testing for statistical significance is to protect Science against this. Even when the data do not fully satisfy the assumptions of the statistical procedure being used (for example, the data are not really a random sample) significance testing can be a useful way of restraining scientists from filling the scientific literature with random garbage. This is such an important goal that we will spend a substantial part of the course on significance testing. \subsection{Definitions}\label{defs} Numbers that can be calculated from sample data are called \textbf{statistics}. Numbers that could be calculated if we knew the whole population are called \textbf{parameters}. Usually parameters are represented by Greek letters such as $\alpha$, $\beta$ and $\gamma$, while statistics are represented by ordinary letters such as $a$, $b$, $c$. Statistical inference consists of making decisions about parameters based on the values of statistics. The \textbf{distribution} of a variable corresponds roughly to a relative frequency histogram of the values of the variable. In a large population for a variable taking on many values, such a histogram will be indistinguishable from a smooth curve\footnote{Since the area under such a curve equals one (remember, it's a \emph{relative} frequency histogram), the smooth curve is a probability density function.}. For each value $x$ of the explanatory variable $X$, in principle there is a separate distribution of the response variable $Y$. This is called the \textbf{conditional distribution} of $Y$ given $XÊ=Êx$. We will say that the explanatory and response variables are \textbf{unrelated} if the \emph{conditional distribution of the response variable is identical for each value of the explanatory variable}\footnote{As a technical note, suppose that $X$ and $Y$ are both continuous. Then the definition of ``unrelated" says $f(y|x) = f(y)$, which is equivalent to $f(x,y) = f(x)f(y)$. This is the definition of independence. So the proposed definition of ``unrelated" is a way of smuggling the idea of statistical independence into this non-technical discussion. I \emph{said} I was going to put the mathematical digressions in footnotes.}. That is, the relative frequency histogram of the response variable does not depend on the value of the explanatory variable. If the distribution of the response variable does depend on the value of the explanatory variable, we will describe the two variables as \textbf{related}. All this vocabulary applies to sample as well as population data-sets\footnote{A population dataset may be entirely hypothetical. For example, if a collection of cancer-prone laboratory mice are given an anti-cancer vaccine, one might pretend that those mice are a random sample from a population of all cancer-prone mice receiving the vaccine -- but of course there is no such population.}. Most research questions involve more than one explanatory variable. It is also common to have more than one response variable. When there is one response variable, the analysis is called \textbf{univariate}. When more than one response variable is being considered simultaneously, the analysis is called \textbf{multivariate}. \begin{quest} Give an example of a study with two categorical explanatory variables, one quantitative explanatory variable, and two quantitative dependent variables. \end{quest} \begin{answ} In a study of success in university, the subjects are first-year university students. The categorical explanatory variables are Sex and Immigration Status (Citizen, Permanent Resident or Visa), and the quantitative explanatory variable is family income. The dependent variables are cumulative Grade Point Average at the end of first year, and number of credits completed in first year. \end{answ} Many problems in data analysis reduce to asking whether one or more variables are related -- not in the actual data, but in some hypothetical population from which the data are assumed to have been sampled. The reasoning goes like this. Suppose that the explanatory and response variables are actually unrelated \emph{in the population}. If this \textbf{null hypothesis} is true, what is the probability of obtaining a \emph{sample} relationship between the variables that is as strong or stronger than the one we have observed? If the probability is small (say, $pÊ< 0.05$), then we describe the sample relationship as \textbf{statistically significant}, and it is socially acceptable to discuss the results. In particular, there is some chance of having the results taken seriously enough to publish in a scientific journal. The number 0.05 is called the \textbf{significance level}. In principle, the exact value of the significance level is arbitrary as long as it is fairly small, but scientific practice has calcified around a suggestion of R. A. Fisher (in whose honour the $F$-test is named), and the 0.05 level is an absolute rule in many journals in the social and biological sciences. We will willingly conform to this convention. We conform \emph{willingly} because we understand that scientists can be highly motivated to get their results into print, even if those ``results" are just trends that could easily be random noise. To restrain these people from filling the scientific literature with random garbage, we need a clear rule. For those who like precision, the formal definition of a $p$-value is this. It is the minimum significance level $\alpha$ at which the null hypothesis (of no relationship between explanatory variable and response variable in the population) can be rejected. Here is another useful way to talk about $p$-values. \emph{The $p$-value is the probability of getting our results (or better) just by chance.} If $p$ is small enough, then the data are very unlikely to have arisen by chance, assuming there is really no relationship between the explanatory variable and the response variable in the population. In this case we will conclude there really \emph{is} a relationship. Of course we seldom or never know for sure what is happening in the entire population. So when we reject a null hypothesis, we may be right or wrong. Sometimes, the null hypothesis is true (nothing is going on) and we mistakenly reject it; this is called a \textbf{Type One Error}. It is also possible that the null hypothesis is false (there really is a relationship between explanatory and response variable in the population) but we fail to reject it. This is called a \textbf{Type Two Error}. This numbering expresses the philosophy that false knowledge is a really bad thing -- it's the Number One kind of mistake you can make. The probability of correctly rejecting the null hypothesis -- that is, the probability of discovering something that really is present, is one minus the probability of a Type Two error. This is called the \textbf{Power} of a statistical test. Clearly, more power is a good thing. But there is a tradeoff between power and Type One error, so that it is impossible for any statistical test to simultaneously minimize the chances of Type One error and maximize the power. The accepted solution is to insist that the Type One error probability be no more than some small value (the significance level -- 0.05 for us), and use the test that has the greatest power subject to this constraint. An important part of theoretical statistics is concerned with proving that certain significance tests that have the best power, and the tests that are used in practice tend to be the winners of this contest. If you think about it for a moment, you will realize that most of the time, even a test with good overall power will not have exactly the same power in every situation. The two main principles are: \begin{itemize} \item The stronger the relationship between variables in the population, the greater the power. \item The larger the sample size, the greater the power. \end{itemize} These two principles may be combined to yield a method for choosing a sample size based on power, before any data have been collected. You choose a strength of relationship that you want to detect, ideally one that is just barely strong enough to be scientifically meaningful. Then you choose a (fairly high) probability with which you want to be able to detect it. Next, you pick a sample size and calculate the power -- not difficult, in this age of computers. It will almost certainly be too low, though it may be higher than you need if you have started with a huge sample size. So you increase (or decrease) the sample size, and calculate the power again. Continue until you have located the smallest sample size that gives you the power you want for the strength of relationship you have chosen. This is not the only rational way to choose sample size, but it is one of the two standard ones.\footnote{The other standard way is to choose the sample size so that a chosen confidence interval will have at most some specified width.} Examples will be given later. Closely related to significance tests are \textbf{confidence intervals}. A confidence interval corresponds to a pair of numbers calculated from the sample data, a lower confidence limit and an upper confidence limit. The confidence limits are chosen so that the probability of the interval containing some parameter (or \emph{function} of the parameters, like a difference between population means) equals a large value, say 0.95. Such a confidence interval would be called a ``ninety-five percent confidence interval." The connection between tests and confidence intervals is that a two tailed $t$-test or $Z$-test will be significant at the 0.05 level if and only if the 95\% confidence interval does not contain zero. \subsection{Should You \emph{Accept} the Null Hypothesis?}\label{acceptH0} What should we do if $p>.05$? Fisher suggested that we should not conclude anything. In particular, he suggested that we should \emph{not} conclude that the explanatory and response variables are unrelated. Instead, we can say only that there is insufficient evidence to conclude that there is a relationship. A good reference is Fisher's masterpiece, \emph{Statistical methods for research workers} \cite{fisher}, which had its first edition in 1925, and its 14th and last edition in 1970, eight years after Fisher's death. In some courses, Fisher's advice is given as an absolute rule. Students are told that one \emph{never} accepts the null hypothesis. But in other courses, if the null hypothesis is not rejected, then it is accepted without further question. Who is right? This is the echo of a very old quarrel between Fisher, who is responsible for the concept of hypothesis testing more or less as we know it, and the team of Jerzy Neyman and Egon Pearson, who came along a bit later and cleaned up Fisher's method, putting it on a firm decision-theoretic basis. The \emph{decision} in question is between the null hypothesis and the alternative hypothesis, period. According to Neyman and Pearson, you have to pick one of them, based on the data. Refusal to decide is not an option. During their lifetimes, Fisher fought bitterly with Neyman and Pearson. To Neyman and Pearson, Fisher was creative but mathematically unsophisticated. To Fisher, Neyman and Pearson were good mathematicians, but they were missing the point, because science does not proceed by simple yes or no decisions made in isolation from one another. Today, Neyman-Pearson theory usually dominates in theoretical research and theoretical courses, while Fisher's approach dominates in applications and applied courses. One might think that because this is an applied course, we'll just side with Fisher. But it's a bit trickier than that. In the typical data analysis project, the first step is to assemble the data file and check it for errors. Then, the usual practice is to carry out a variety of statistical tests to get a preliminary idea of how the variables are related to each other. This phase can be automated (as in stepwise regression) or not, but in general you try a lot of tests, and if a potential explanatory variable is not significantly related to the response variable in question, you usually just drop it and look elsewhere. That is, the null hypothesis is freely accepted, and the Neyman-Pearson approach seems to govern this most applied of statistical pursuits. You can't fault this; scientists must explore their data, and statistical testing is a good way to do it. But it is helpful to distinguish between \emph{exploratory} and \emph{confirmatory} statistical analysis. In an exploratory analysis, the researcher carries out a large number of tests in an attempt to understand how the variables are related to one another. Various statistical models are employed, variables may be defined and re-defined several times, and the sample may be subdivided in various ways. Anything reasonable may be (and should be) attempted. Numerous null hypotheses may be tentatively rejected, and numerous others may be tentatively accepted. Properly speaking, the product of an exploratory analysis is hypotheses, not conclusions. It is rare for all the details of an exploratory analysis to be given in writing, though it is good practice to keep a record of what has been tried. In a confirmatory analysis, a more limited number of tests are carried out with the intention of coming to firm conclusions.\footnote{Ideally, exploratory and confirmatory analyses should be carried out on different data sets, possibly by randomly splitting the data into exploratory and confirmatory sub-samples. But this is only feasible when data are not too expensive or time-consuming to collect. In practice, researchers often explore their data thoroughly, and then report the most interesting results as if they were a confirmatory analysis. This practice is almost guaranteed to inflate the probability of Type One error, so it is wise to treat the results of most scientific investigations as tentative until they have been independently replicated. In any case, it is useful to distinguish \emph{conceptually} between exploratory and confirmatory analysis, even though the pure forms may seen only rarely in practice.} The results of confirmatory analyses \emph{are} often written up, because communication of results is in many ways the most important phase of any investigation. It is clear that acceptance of the null hypothesis is a standard feature of good exploratory analysis, even if it is not recognized as such. The argument between Fisher and Neyman-Pearson is whether the null hypothesis should be accepted in confirmatory analysis. First of all, it's clear that Fisher is right in a way. Suppose you wish to compare two methods of teaching the piano. You randomly assign three students to one method and two students to the other. After some reasonable period of time, you compare ratings of their performance, using a two-sample $t$ test or something. Suppose the results are not statistically significant. Does it make sense to conclude that the two methods are equally effective? Obviously not; the sample size is so small that we probably don't have enough power to detect even a fairly large effect. But Neyman and Pearson do not give up, even in this situation. They say that if one had to choose based just on this tiny data set, the conclusion of no effect would be the rational choice. Meanwhile, Fisher is going crazy. Who would decide anything based on such inadequate evidence? He does not know whether to laugh at them or tear his hair out, so he does both, in public. On their side, Neyman and Pearson are irritated by Fisher's unwillingness (or inability) to appreciate that when statistical tests emerge as mathematical consequences of a general theory, this is better than just making them up out of thin air. Fisher wins this round, but it's not over. The trouble with his approach is that it \emph{never} allows one to conclude that the null hypothesis is true. But sometimes, experimental treatments just don't do anything, and it is of scientific and practical importance to be able to say so. For example, medical researchers frequently conclude that drugs don't work. On what basis are they drawing these conclusions? On what basis \emph{should} they draw such conclusions? Unfortunately, though there are clear conventional standards for deciding when a relationship is present, there is much less agreement on how to decide that one is absent. In medical research, scientists often get away with such claims based only on the fact that a test fails to attain statistical significance. Then, if the sample size is not unusually small, nobody objects. It seems to depend on the editor of the journal. There are a couple of reasonable suggestions about how to be more systematic (need references here). Both methods stop short of allowing you to conclude that a relationship is completely absent. Instead, they focus on deciding that the relationship between explanatory variable and response variable is so weak that it does not matter, if it exists at all. One approach is based on power. Suppose you have selected the sample size so that that there is a high probability (maybe 95\%) of detecting a relationship that is just barely meaningful (of course, if the relationship in the population happens to be stronger, the probability of detecting it will be even higher). Then, if the test is non-significant, you conclude that the relationship is not strong enough to be meaningful. Another approach is based on confidence intervals. Again, you need to be able to specify what's scientifically or perhaps clinically meaningful, in terms of the population parameters. You construct a confidence interval for the quantity in question (for example a difference between means). If the 95\% confidence interval lies entirely within a range of values that is scientifically meaningless, you conclude that the relationship is not strong enough to be meaningful. These two reasonable methods need not yield the same conclusion for a given data set; the confidence interval approach allows a relationship to be deemed negligible even though it is statistically significant, while the power approach does not. Figure~\ref{AcceptCI} shows how this can happen. Notice that the 95\% confidence interval is entirely within the range of values deemed too small to be meaningful. But the confidence interval does not contain zero, so $p < 0.05$. Any time the true parameter value is in the non-meaningful range but is not exactly zero, a configuration like this is guaranteed to occur if the sample size is large enough. \begin{figure}% [here] \caption{A relationship that is significant but too weak to be meaningful.} \begin{center} \includegraphics[width=4in]{AcceptH0} \end{center} \label{AcceptCI} \end{figure} Unfortunately, both the power method and the confidence interval method typically require a very large sample to conclude that a relationship is (virtually) absent. So it often happens that an important test is non-significant, but the power for detecting a marginal effect was fairly low, and the confidence interval includes both zero \emph{and} values that are not trivial. In this situation, the best we can do is follow Fisher's advice, and say that the data do not provide sufficient evidence to conclude that the explanatory and response variables are related. Frequently, one has to write for a non-technical audience, and an important part of this course is to express conclusions in plain, non-technical language --- language that is understandable to someone with no statistical training, but at the same time acceptable to experts. Suppose you need to state conclusions, and the results are not statistically significant. Most of your primary audience has no statistical background, so you need to speak in clear, non-statistical language. But \emph{some} of the audience (maybe including the technical staff of your main audience) will be very disturbed if you seem to be accepting the null hypothesis; they can make a lot of trouble. How do you finesse this? Here are some statements that are acceptable. It's good not to use exactly the same phrase over and over. \begin{itemize} \item The data do not provide evidence that the treatment has any effect. \item There was no meaningful connection between~\ldots \item The results were consistent with no treatment effect. \item The results were consistent with no association between astrological sign and personality type. \item The small differences in average taste ratings could have been due to sampling error. \item The small differences in average taste ratings were within the range of sampling error. \end{itemize} The nice thing about using this kind of language is that it communicates clearly to non-experts, but it lets the experts read between the lines and see that you are aware of the technical (philosophic) issue, and that you are being careful. There are many, many more examples in Moore and McCabe's \emph{Introduction to the practice of statistics}~\cite{mm93}. This introductory text is simple and non-technical on the surface, but written with all the theoretical complexities clearly in mind and under control. The result is a book that satisfies both the absolute beginner and the professional statistician --- quite an accomplishment. \subsection{The Format of the Data File is Important!} If you're the person who will be doing the statistical analysis for a research study, there is an initial period where you are learning the objectives of the study and how the data are going to be collected. For example, perhaps participants are going to watch some commercials and then fill out a questionnaire. From the very beginning, you should be thinking about what the cases are, what the explanatory and response variables are, checking whether determining the relationships between explanatory and response variables will satisfy the objectives of the research, and deciding what statistical tests to employ. All this applies whether you are helping plan the study, or (more likely, if you are a statistician) you are being brought in only after the data have already been collected. Many scientific questions can be answered by determining whether explanatory variables and response variables are related. This makes it helpful to arrange data files in the row-by-column format suggested at the beginning of this chapter. Again, rows are usually cases, and columns are usually variables. But most data do not automatically come in this format unless a knowledgeable person has arranged it that way. \begin{hint}\label{rowbycol} If a data set is not already in a row-by-column format with rows corresponding to cases and columns corresponding to variables, you should put it in this format yourself, or get someone else to do it. \end{hint} Statistical software (including SAS) mostly expects data to be arranged this way, so Hint~\ref{rowbycol} is partly a matter of convenience. But there's more to it than that. You might be surprised how much a good data format can support good research design. For example, it is common for people who are very smart in other ways to record data over time at considerable effort and expense, but to change the data that are recoded or the way they are recorded throughout the course of the study. As a result, almost nothing is comparable, and most of the effort is wasted. An investigator who is thinking in terms of variables and cases is less likely to make this blunder. The row-by-column format forces you to know how many cases there are, and which data come from the same case. Also, thinking in terms of variables helps you decide whether two different variables are intended as measures of the same thing at different times, or as quantities that are completely different. On the other hand, you should keep your mind open. It is possible that for some studies and certain advanced statistical models, a different structure of the data file could be better. But I have never seen an example that applies to real data. In my experience, when data are recorded in a format other than the one advocated here, it is a sign of \emph{lack} of sophistication on the part of the researchers. So in the next section, please pay attention to the format of the data files. Bear in mind, though, that these are all \emph{elementary} tests, with one explanatory variable and one response variable. Almost all real data sets have more than two variables. \subsection{Standard elementary significance tests}\label{etests} We will now consider some of the most common elementary statistical methods; these are covered in most introductory statistics courses. There is always just one explanatory variable and one response variable. For each test, you should be able to answer the following questions. \begin{enumerate} \item Make up your own original example of a study in which the technique could be used. \item In your example, what is the explanatory variable? \item In your example, what is the response variable? \item Indicate how the data file would be set up. \end{enumerate} \paragraph{Independent observations} One assumption shared by most standard methods is that of \emph{"independent observations."} The meaning of the assumption is this. Observations 13 and 14 are independent if and only if the conditional distribution of observation 14 given observation 13 is the same for each possible value observation 13. For example if the observations are temperatures on consecutive days, this would not hold. If the response variable is score on a homework assignment and students copy from each other, the observations will not be independent. When significance testing is carried out under the assumption that observations are independent but really they are not, results that are actually due to chance will often be detected as significant with probability considerably greater than 0.05. This is sometimes called the problem of \emph{inflated n}. In other words, you are pretending you have more separate pieces of information than you really do. When observations cannot safely be assumed independent, this should be taken into account in the statistical analysis. We will return to this point again and again. \subsubsection{Independent (two-sample) $t$-test} This is a test for whether the means of two independent groups are different. Assumptions are independent observations, normality within groups, equal variances. For large samples normality does not matter. For large samples with nearly equal sample sizes, equal variance assumption does not matter. The assumption of independent observations is always important. \begin{quest} Make up your own original example of a study in which a two-sample $t$-test could be used. \end{quest} \begin{answ} An agricultural scientist is interested in comparing two types of fertilizer for potatoes. Fifteen small plots of ground receive fertilizer A and fifteen receive fertilizer B. Crop yield for each plot in pounds of potatoes harvested is recorded. \end{answ} \begin{quest} In your example, what is the explanatory variable (or variables)? \end{quest} \begin{answ} Fertilizer, a binary variable taking the values A and B. \end{answ} \begin{quest} In your example, what is the response variable (or variables)? \end{quest} \begin{answ} Crop yield in pounds. \end{answ} \begin{quest} Indicate how the data file might be set up. \end{quest} \begin{answ} \end{answ} {\begin{center} \begin{tabular}{cc} A & 13.1 \\ A & 11.3 \\ \vdots & \vdots \\ B & 12.2 \\ \vdots & \vdots \\ \end{tabular} \end{center}} \subsubsection{Matched (paired) $t$-test} Again comparing two means, but from paired observations. Pairs of observations come from the same case (subject, unit of analysis), and presumably are non-independent. The matched $t$-test takes this lack of independence into account by computing a difference for each pair, reducing the volume of data (and the apparent sample size) by half. This is our first example of a \emph{repeated measures} analysis. Here is a general definition. We will say that there are \textbf{repeated measures} on an explanatory variable if a case (unit of analysis, subject, participant in the study) contributes a value of the response variable for each value of the explanatory variable in question. A variable on which there are repeated measures is sometimes called a \textbf{within-cases} (or within-subjects) variable. When this language is being spoken, variables on which there are not repeated measures are called \textbf{between-cases}. In a within-cases design, each case serves as its own control. When the correlations among data from the same case are substantial, a within-cases design can have higher power than a between-cases design. The assumptions of the matched $t$-test are that the differences represent independent observations from a normal population. For large samples, normality does not matter. The assumption that different cases represent independent observations is always important. \begin{quest} Make up your own original example of a study in which a matched $t$-test could be used. \end{quest} \begin{answ} Before and after a 6-week treatment, participants in a quit-smoking program were asked ``On the average, how many cigarettes do you smoke each day?" \end{answ} \begin{quest} In your example, what is the explanatory variable (or variables)? \end{quest} \begin{answ} Presence versus absence of the program, a binary variable taking the values ``Absent" or ``Present" (or maybe ``Before" and ``After"). We can say there are \emph{repeated measures} on this factor, or that it is a \emph{within-subjects} factor. \end{answ} \begin{quest} In your example, what is the response variable (or variables)? \end{quest} \begin{answ} Reported number of cigarettes smoked per day. \end{answ} \begin{quest} Indicate how the data file might be set up. \end{quest} \begin{answ} The first column is ``Before," and the second column is ``After." \end{answ} {\begin{center} \begin{tabular}{cc} 22 & 18 \\ 40 & 34 \\ 20 & 10 \\ \vdots & \vdots \\ \end{tabular} \end{center}} \subsubsection{One-way Analysis of Variance} Extension of the independent $t$-test to two or more groups. Same assumptions, everything. $F = t^2$ for two groups. \begin{quest} Make up your own original example of a study in which a one-way analysis of variance could be used. \end{quest} \begin{answ} Eighty branches of a large bank were chosen to participate in a study of the effect of music on tellers' work behaviour. Twenty branches were randomly assigned to each of the following 4 conditions. 1=No music, 2=Elevator music, 3=Rap music, 4=Individual choice (headphones). Average customer satisfaction and worker satisfaction were assessed for each bank branch, using a standard questionnaire. \end{answ} \begin{quest} In your example, what are the cases? \end{quest} \begin{answ} Branches, not people answering the questionnaire. \end{answ} \begin{quest} Why do it that way? \end{quest} \begin{answ} To avoid serious potential problems with independent observations within branches. The group of interacting people within social setting is the natural unit of analysis, like an organism. \end{answ} \begin{quest} In your example, what is the explanatory variable (or variables)? \end{quest} \begin{answ} Type of music, a categorical variable taking on 4 values. \end{answ} \begin{quest} In your example, what is the response variable (or variables)? \end{quest} \begin{answ} There are 2 response variables, average customer satisfaction and average worker satisfaction. If they were analyzed simultaneously the analysis would be multivariate (and not elementary). \end{answ} \begin{quest} Indicate how the data file might be set up. \end{quest} \begin{answ} The columns correspond to Branch, Type of Music, Customer Satisfaction and Worker Satisfaction \end{answ} {\begin{center} \begin{tabular}{cccc} 1 & 2 & 4.75 & 5.31 \\ 2 & 4 & 2.91 & 6.82 \\ \vdots & \vdots & \vdots & \vdots \\ 80 & 2 & 5.12 & 4.06 \\ \end{tabular} \end{center}} \begin{quest} How could this be made into a repeated measures study? \end{quest} \begin{answ} Let each branch experience each of the 4 music conditions in a random order (or better, use only 72 branches, with 3 branches receiving each of the 24 orders). There would then be 10 pieces of data for each bank: Branch, Order (a number from 1 to 24), and customer satisfaction and worker satisfaction for each of the 4 conditions. \label{counterbal} \end{answ} Including all orders of presentation in each experimental condition is an example of \textbf{counterbalancing} --- that is, presenting stimuli in such a way that order of presentation is unrelated to experimental condition. That way, the effects of the treatments are not confused with fatigue or practice effects (on the part of the experimenter as well as the subjects). In counterbalancing, it is often not feasible to include \emph{all} possible orders of presentation it each experimental condition, because sometimes there are too many. The point is that order of presentation has to be unrelated to any manipulated explanatory variable. \subsubsection{Two (and higher) way Analysis of Variance} Extension of One-Way ANOVA to allow assessment of the joint relationship of several categorical explanatory variables to one quantitative response variable that is assumed normal within treatment combinations. Tests for interactions between explanatory variables are possible. An interaction means that the relationship of one explanatory variable to the response variable \emph{depends} on the value of another explanatory variable. This method is not really elementary, because there is more than one explanatory variable. \subsubsection{Crosstabs and chi-squared tests} Cross-tabulations (Crosstabs) are joint frequency distribution of two categorical variables. One can be considered an explanatory variable, the other a response variable if you like. In any case (even when the explanatory variable is manipulated in a true experimental study) we will test for significance using the \emph{chi-squared test of independence}. Assumption is independent observations are drawn from a multinomial distribution. Violation of the independence assumption is common and very serious. \begin{quest} Make up your own original example of a study in which this technique could be used. \end{quest} \begin{answ} For each of the prisoners in a Toronto jail, record the race of the offender and the race of the victim. This is illegal; you could go to jail yourself for publishing the results. It's totally unclear which is the explanatory variable and which is the response variable, so I'll make up another example. For each of the graduating students from a university, record main field of study and and gender of the student (male or female). \end{answ} \begin{quest} In your example, what is the explanatory variable (or variables)? \end{quest} \begin{answ} Gender \end{answ} \begin{quest} In your example, what is the response variable (or variables)? \end{quest} \begin{answ} Main field of study (many numeric codes). \end{answ} \begin{quest} Indicate how the data file would be set up. \end{quest} \begin{answ} The first column is Gender (0=Male, 1=F). The second column is Field. \end{answ} {\begin{center} \begin{tabular}{cc} 1 & 2 \\ 0 & 14 \\ 0 & 9 \\ \vdots & \vdots \\ \end{tabular} \end{center}} \subsubsection{Correlation and Simple Regression} \paragraph{Correlation} Start with a \textbf{scatterplot} showing the association between two (quantitative, usually continuous) variables. A scatterplot is a set of Cartesian coordinates with a dot or other symbol showing the location of each $(x,y)$ pair. If one of the variables is clearly the explanatory variable, it's traditional to put it on the $x$ axis. There are $n$ points on the scatterplot, where $n$ is the number of cases in the data file. Often, the points in a scatterplot cluster around a straight line. The correlation coefficient (Pearson's $r$) expresses how close the points are to the line. \pagebreak Here are some properties of the correlation coefficient $r$: \begin{itemize} \item $-1 \leq r \leq 1$ \item $r = +1$ indicates a perfect positive linear relationship. All the points are exactly on a line with a positive slope. \item $r = -1$ indicates a perfect negative linear relationship. All the points are exactly on a line with a negative slope. \item $r = 0$ means no \emph{linear} relationship (curve possible) \item $r^2$ represents explained variation, reduction in (squared) error of prediction. For example, the correlation between scores on the Scholastic Aptitude Test (SAT) and first-year grade point average (GPA) is around +0.50, so we say that SAT scores explain around 25\% of the variation in first-year GPA. \end{itemize} The test of significance for Pearson's $r$ assumes a bivariate normal distribution for the two variables; this means that the only possible relationship between them is linear. As usual, the assumption of independent observations is always important. Here are some examples of scatterplots and the associated correlation coefficients. The number 2 on a plot means that two points are on top of each other, or at least too close to be distinguished in this crude line printer graphic. \begin{scriptsize} \begin{verbatim} - * * C1 - - - * 60+ ** * - * * * * 2* * * - * ** * * * - * - * 2 2* ** * * * 45+ * * *2 * - * * * - * - * * * - * * 30+ - * * - +---------+---------+---------+---------+---------+------C3 20 30 40 50 60 70 Correlation of C1 and C3 = 0.004 \end{verbatim}\end{scriptsize} \begin{scriptsize} \begin{verbatim} 75+ * - C4 - - - * * * 60+ * - * * * * 2 * * * - * ** ** - * * *2 - * ** * * * * 45+ * * ** * * * - 2 * - * 2 *** - * - 30+ * * - ------+---------+---------+---------+---------+---------+C6 112 128 144 160 176 192 Correlation of C4 and C6 = 0.112 \end{verbatim}\end{scriptsize} \begin{scriptsize} \begin{verbatim} 80+ - * C3 - * - - * * 60+ * * * * - * * * * - * * ** * - * * * * *2** * ** 2 * * * - * 2 2 * * 40+ * * * - * ** - * * - - * 20+ - --+---------+---------+---------+---------+---------+----C7 165 180 195 210 225 240 Correlation of C3 and C7 = 0.368 \end{verbatim}\end{scriptsize} \pagebreak \begin{scriptsize} \begin{verbatim} 75+ * - C4 - - - * * * 60+ * - * * * *** * * * - * * * ** - * * 2 * - * * * * * * * 45+ * * ** * * * - ** * - * 2 *** - * - 30+ ** - --+---------+---------+---------+---------+---------+----C7 165 180 195 210 225 240 Correlation of C4 and C7 = 0.547 \end{verbatim}\end{scriptsize} %\pagebreak \begin{scriptsize} \begin{verbatim} - C5 - * * - - * * * * 120+ * - * * - * * * - * * 2 - * ** 100+ * * * * ** * * * - * * * - * * * * - * * * * * - * * * * 80+ ** * - * - * * --+---------+---------+---------+---------+---------+----C7 165 180 195 210 225 240 Correlation of C5 and C7 = 0.733 \end{verbatim}\end{scriptsize} \pagebreak \begin{scriptsize} \begin{verbatim} - C5 - ** - - * * * * 120+ * - * * - * * * - 2** - * ** 100+ * * 2 *2 * * - ** * - **2 - 2 * * * - * * * * 80+ 2 * - * - * * --+---------+---------+---------+---------+---------+----C9 -192 -176 -160 -144 -128 -112 Correlation of C5 and C9 = -0.822 \end{verbatim}\end{scriptsize} % \pagebreak \begin{scriptsize} \begin{verbatim} - - * 100+ * *2 - ** * C2 - 2* * * - ** * * ** - 2* 2 ** 50+ * ** *2 - * ** * - *** 2* ** * *** - * * * * * * - ** * 2 * * * * * * 0+ *** * 2 * * ** - * ** - * * - * - --------+---------+---------+---------+---------+--------C1 -8.0 -4.0 0.0 4.0 8.0 Correlation of C1 and C2 = 0.025 \end{verbatim}\end{scriptsize} \pagebreak \begin{scriptsize} \begin{verbatim} 200+ - C2 - * ** * - ** ** *** * - * * * ** * * * 100+ * ** ** 2** ** * * - **** * * 2 2 - * * * - * * * *** * - * *** *****2 0+ * * * - * - * * * - * ** - ****** -100+ * - --------+---------+---------+---------+---------+--------C1 -8.0 -4.0 0.0 4.0 8.0 Correlation of C1 and C2 = -0.811 \end{verbatim}\end{scriptsize} \paragraph{Simple Regression} One explanatory variable, one dependent. In the usual examples both are quantitative (continuous). We fit a \textbf{least-squares} line to the cloud of points in a scatterplot. The least-squares line is the unique line that minimizes the sum of squared vertical distances between the line and the points in the scatterplot. That is, it minimizes the total (squared) error of prediction. Denoting the slope of the least-squares line by $b_1$ and the intercept of the least-squares line by $b_0$, \begin{displaymath} b_1 = r \frac{s_y}{s_x} \mbox{ and } b_0 = \overline{Y} - b_1 \overline{X}. \end{displaymath} That is, the slope of the least squares has the same sign as the correlation coefficient, and equals zero if and only if the correlation coefficient is zero. Usually, you want to test whether the slope is zero. This is the same as testing whether the correlation is zero, and mercifully yields the same $p$-value. Assumptions are independent observations (again) and that within levels of the explanatory variable, the response variable has a normal distribution with the same variance (variance does not depend on value of the response variable). Robustness properties are similar to those of the 2-sample $t$-test. The assumption of independent observations is always important. \subsubsection{Multiple Regression} Regression with several explanatory variables at once; we're fitting a (hyper) plane rather than a line. Multiple regression is very flexible; all the other techniques mentioned above (except the chi-squared test) are special cases of multiple regression. More details will be given later. \section{Experimental versus observational studies} Why might someone want to predict a response variable from an explanatory variable? There are two main reasons. \begin{itemize} \item There may be a practical reason for prediction. For example, a company might wish to predict who will buy a product, in order to maximize the productivity of its sales force. Or, an insurance company might wish to predict who will make a claim, or a university computer centre might wish to predict the length of time a type of hard drive will last before failing. In each of these cases, there will be some explanatory variables that are to be used for prediction, and although the people doing the study may be curious and may have some ideas about how things might turn out and why, they don't really care why it works, as long as they can predict with some accuracy. Does variation in the explanatory variable \emph{cause} variation in the response variable? Who cares? \item This may be science (of some variety). The goal may be to understand how the world works --- in particular, to understand the response variable. In this case, most likely we are implicitly or explicitly thinking of a causal relationship between the explanatory variable and response variable. Think of attitude similarity and interpersonal attraction~\ldots. \end{itemize} \begin{quest} A study finds that high school students who have a computer at home get higher grades on average than students who do not. Does this mean that parents who can afford it should buy a computer to enhance their children's chances of academic success? \end{quest} Here is an answer that gets \textbf{zero} points. ``Yes, with a computer the student can become computer literate, which is a necessity in our competitive and increasingly technological society. Also the student can use the computer to produce nice looking reports (neatness counts!), and obtain valuable information on the World Wide Web." \textbf{ZERO}. The problem with this answer is that while it makes some fairly reasonable points, it is based on personal opinion, and fails to address the real question, which is ``\textbf{Does this mean} \ldots" Here is an answer that gets full marks. \begin{answ} Not necessarily. While it is possible that some students are doing better academically and therefore getting into university because of their computers, it is also possible that their parents have enough money to buy them a computer, and also have enough money to pay for their education. It may be that an academically able student who is more likely to go to university will want a computer more, and therefore be more likely to get one somehow. Therefore, the study does not provide good evidence that a computer at home will enhance chances of academic success. \end{answ} Note that in this answer, the \emph{focus is on whether the study provides good evidence} for the conclusion, not whether the conclusion is reasonable on other grounds. And the answer gives \emph{specific alternative explanations} for the results as a way of criticizing the study. If you think about it, suggesting plausible alternative explanations is a very damaging thing to say about any empirical study, because you are pointing out that the investigators expended a huge amount of time and energy, but didn't establish anything conclusive. Also, suggesting alternative explanations is extremely valuable, because that is how research designs get improved and knowledge advances. In all these discussions of causality, it is important to understand what the term does \emph{not} mean. If we say that smoking cigarettes causes lung cancer, it does not mean that you will get lung cancer if and only if you smoke cigarettes. It means that smoking \emph{contributes} to the \emph{chances} that you will get cancer. So when we say ``cause," we really mean ``contributing factor." And it is almost always one contributing factor among many. Now here are some general principles. If $X$ and $Y$ are measured at roughly the same time, $X$ could be causing $Y$, Y could be causing $X$, or there might be some third variable (or collection of variables) that is causing both $X$ and $Y$. Therefore we say that "Correlation does not necessarily imply causation." Here, by correlation we mean association (lack of independence) between variables. It is not limited to situations where you would compute a correlation coefficient. A \textbf{confounding variable} is a variable not included as an explanatory variable, that might be related to both the explanatory variable and the response variable -- and that might therefore create a seeming relationship between them where none actually exists, or might even hide a relationship that is present. Some books also call this a ``lurking variable." You are responsible for the vocabulary ``confounding variable." An \textbf{experimental study} is one in which cases are randomly assigned to the different values of an explanatory variable (or variables). An \textbf{observational study} is one in which the values of the explanatory variables are not randomly assigned, but merely observed. Some studies are purely observational, some are purely experimental, and many are mixed. It's not really standard terminology, but in this course we will describe explanatory \emph{variables} as experimental (i.e., randomly assigned, manipulated) or observed. In an experimental study, there is no way the response variable could be causing the explanatory variable, because values of the explanatory variable are assigned by the experimenter. Also, it can be shown (using the Law of Large Numbers) that when units of observation are randomly assigned to values of an explanatory variable, all potential confounding variables are cancelled out as the sample size increases. This is very wonderful. You don't even have to know what they are! \begin{quest} Is it possible for a continuous variable to be experimental, that is, randomly assigned? \label{ndose} % See Sample Question~\ref{ndose} \end{quest} \begin{answ} Sure. In a drug study, let one of the explanatory variables consist of $n$ equally spaced dosage levels spanning some range of interest, where $n$ is the sample size. Randomly assign one participant to each dosage level. \end{answ} \begin{quest} Give an original example of a study with one quantitative observed explanatory variable and one categorical manipulated explanatory variable. Make the study multivariate, with one response variable consisting of unordered categories and two quantitative response variables. \end{quest} \begin{answ} Stroke patients in a drug study are randomly assigned to either a standard blood pressure drug or one of three experimental blood pressure drugs. The categorical response variable is whether the patient is alive or not 5 years after the study begins. The quantitative response variables are systolic and diastolic blood pressure one week after beginning drug treatment. \end{answ} In practice, of course there would be a lot more variables; but it's still a good answer. Because of possible confounding variables, only an experimental study can provide good evidence that an explanatory variable \emph{causes} a response variable. Words like effect, affect, leads to etc. imply claims of causality and are only justified for experimental studies. \begin{quest} Design a study that could provide good evidence of a causal relationship between having a computer at home and academic success. \end{quest} \begin{answ} High school students without computers enter a lottery. The winners (50\% of the sample) get a computer to use at home. The response variable is whether or not the student enters university. \end{answ} \begin{quest} Is there a problem with independent observations here? Can you fix it? \end{quest} \begin{answ} Oops. Yes. Students who win may be talking to each other, sharing software, etc.. Actually, the losers will be communicating too. Therefore their behaviour is non-independent and standard significance tests will be invalid. One solution is to hold the lottery in n separate schools, with one winner in each school. If the response variable were GPA, we could do a matched t-test comparing the performance of the winner to the average performance of the losers. \end{answ} \begin{quest} What if the response variable is going to university or not? \end{quest} \begin{answ} We are getting into deep water here. Here is how I would do it. In each school, give a score of ``1" to each student who goes to university, and a ``0" to each student who does not. Again, compare the scores of the winners to the average scores of the losers in each school using a matched t-test. Note that the mean difference that is to be compared with zero here is the mean difference in probability of going to university, between students who get a computer to use and those who do not. While the differences for each school will not be normally distributed, the central limit theorem tells us that the mean difference will be approximately normal if there are more than about 20 schools, so the t-test is valid. In fact, the t-test is conservative, because the tails of the t distribution are heavier than those of the standard normal. This answer is actually beyond the scope of the present course. \end{answ} \subsubsection{Artifacts and Compromises}\label{artifacts} Random assignment to experimental conditions will take care of confounding variables, but only if it is done right. It is amazingly easy for for confounding variables to sneak back into a true experimental study through defects in the procedure. For example, suppose you are interested in studying the roles of men and women in our society, and you have a 50-item questionnaire that (you hope) will measure traditional sex role attitudes on a scale from 0 = Very Non-traditional to 50 = Very Traditional. However, you suspect that the details of how the questionnaire is administered could have a strong influence on the results. In particular, the sex of the person administering the questionnaire and how he or she is dressed could be important. Your subjects are university students, who must participate in your study in order to fulfill a course requirement in Introductory Psychology. You randomly assign your subjects to one of four experimental conditions: Female research assistant casually dressed, Female research assistant formally dressed, Male research assistant casually dressed, or Male research assistant formally dressed. Subjects in each experimental condition are instructed to report to a classroom at a particular time, and they fill out the questionnaire sitting all together. This is an appealing procedure from the standpoint of data collection, because it is fast and easy. However, it is so flawed that it may be a complete waste of time to do the study at all. Here's why. Because subjects are run in four batches, an unknown number of confounding variables may have crept back into the study. To name a few, subjects in different experimental conditions will be run at different times of day or different days of the week. Suppose subjects in the the male formally dressed condition fill out the questionnaire at 8 in the morning. Then \emph{all} the subjects in that condition are exposed to the stress and fatigue of getting up early, as well as the treatment to which they have been randomly assigned. There's more, of course. Presumably there are just two research assistants, one male and one female. So there can be order effects; at the very least, the lab assistant will be more practiced the second time he or she administers the questionnaire. And, though the research assistants will surely try to administer the questionnaire in a standard way, do you really believe that their body language, facial expressions and tone of voice will be identical both times? Of course, the research assistants know what condition the subjects are in, they know the hypotheses of the study, and they probably have a strong desire to please the boss --- the investigator (professor or whatever) who is directing this turkey, uh, excuse me, I mean this research. Therefore, their behaviour could easily be slanted, perhaps unconsciously so, to produce the hypothesized effects. This kind phenomenon is well-documented. It's called \emph{experimenter expectancy}. Experimenters find what they expect to find. If they are led to believe that certain mice are very intelligent, then those mice will do better on all kinds of learning tasks, even though in fact the mice were randomly assigned to be labeled as ``intelligent." This kind of thing applies all the way down to flatworms. The classic reference is Robert Rosenthal's \emph{Experimenter expectancy in behavioral research}~\cite{expexp}. Naturally, the expectancy phenomenon applies to teachers and students in a classroom setting, where it is called \emph{teacher expectancy}. The reference for this is Rosenthal and Jacobson's \emph{Pygmalion in the classroom}~\cite{pyg}. It is wrong (and complacent) to believe that expectancy effects are confined to psychological research. In medicine, \emph{placebo effects} are well-documented. Patients who are given an inert substance like a sugar pill do better than patients who are not, provided that they or their doctors believe that they are getting medicine that works. Is it the patients' expectancies that matter, or the doctors'? Probably both. The standard solution, and the \emph{only} acceptable solution in clinical trials of new drugs, is the so called \emph{double blind}, in which subjects are randomly assigned to receive either the drug or a placebo, and neither the patient nor the doctor knows which it is. This is the gold standard. Accept no substitutes. Until now, we have been discussing threats to the \emph{Internal Validity} of research. A study has good internal validity if it's designed to eliminate the influence of confounding variables, so one can be reasonably sure that the observed effects really are being produced by the explanatory variables of interest. But there's also \emph{External Validity}. External validity refers to how well the phenomena outside the laboratory or data-collection situation are being represented by the study. For example, well-controlled, double-blind taste tests indicated that the Coca-cola company had a recipe that consumers liked better than the traditional one. But attempts to market ``New" Coke were an epic disaster. There was just more going on in the real world of soft drink consumption than in the artificial laboratory setting of a taste test. Cook and Campbell's \emph{Quasi-experimentation}~\cite{quasi} contains an excellent discussion of internal versus external validity. In Industrial-Organizational psychology, we have the \emph{Hawthorne Effect}, which takes its name from the Hawthorne plant of General Electric, where some influential studies of worker productivity were carried out in the 1930's. The basic idea is that when workers know that they are part of a study, almost anything you do will increase productivity. Make the lights brighter? Productivity increases. Make the lights dimmer? Productivity increases. This is how the Hawthorne Effect is usually described. The actual details of the studies and their findings are more complex~\cite{hawth}, but the general idea is that when people know they are participating in a study, they tend to feel more valued, and act accordingly. In this respect, the fact that the subjects know that a study is being carried can introduce a serious distortion into the way things work, and make the results unrepresentative of what normally happens. Medical research on non-human animals is always at least subject to discussion on grounds of external validity, as is almost any laboratory research in Psychology. Do you know why the blood vessels running away from the heart are called ``arteries?" It's because they were initially thought to contain air. Why? Because medical researchers were basing their conclusions entirely on dissections of dead bodies. In live bodies, the arteries are full of blood. Generally speaking, the controlled environments that lead to the best internal validity also produce the greatest threats to external validity. Is a given laboratory setup capturing the essence of the phenomena under consideration, or is it artificial and irrelevant? It's usually hard to tell. The best way to make an informed judgement is to compare laboratory studies and field studies that are trying to answer the same questions. The laboratory studies usually have better internal validity, and the field studies usually have better external validity. When the results are consistent, we feel more comfortable. \chapter{Introduction to SAS} \label{sas} SAS stands for ``Statistical Analysis System." Even though it runs on linux and Windows PCs as well as on bigger computers, it is truly the last of the great old mainframe statistical packages\footnote{This discussion refers to the core applications that are used to conduct traditional statistical analysis: Base SAS, SAS/STAT and SAS/ETS (Econometrics and Time Series). SAS also sells a variety of other software products. They are almost all tools for data extraction, processing and analysis, so they fall under the heading of Statistics broadly defined. However, the details are so shrouded in marketing and corporate IT jargon that you would need specialized (and expensive) training to understand what they do, and even then I assume the details are proprietary. This is a strategy that works well for the SAS Institute.}. The first beta release was in 1971, and the SAS Institute, Inc. was spun off from the University of North Carolina in 1976, the year after Bill Gates dropped out of Harvard. This is a serious pedigree, and it has both advantages and disadvantages. The advantages are that the number of statistical procedures SAS can do is truly staggering, and the most commonly used ones have been tested so many times by so many people that their correctness and numerical efficiency are beyond any question. For the purposes of this course, there are no bugs. The disadvantages of SAS are all related to the fact that it was \emph{designed} to run in a batch-oriented mainframe environment. So, for example, the SAS Institute has tried hard to make SAS an ``interactive" program, but as of January 2016, the interface is still basically file and text oriented, not graphical. \section{The Four Main File Types} A typical SAS job will involve four main types of file. \begin{itemize} \item \textbf{The Raw Data File}: A file consisting of rows and columns of numbers; or maybe some of the columns have letters (character data) instead of numbers. The rows represent observations and the columns represent variables, as described at the beginning of Section~\ref{vocab}. In the first example we will consider below, the raw data file is a plain text file called \texttt{studentsleep.data.txt}. In recent years it has become common for scientists to record their data using Microsoft Excel, so that real (not textbook) data sets will often be in Excel spreadsheets. The best arrangement is for rows to be cases and columns to be variables. SAS can read data directly from an Excel spreadsheet; this is illustrated for Student's sleep data in Section~\ref{EXCEL}. Data sets coming from corporations and other organizations may be in Excel format, or they may be in a relational database produced by software such as Microsoft Access. Databases can be imported using \texttt{proc sql} (Structured Query Language). \item \textbf{The Program File}: The program file consists of commands that the SAS software tries to follow. You create this file with a text editor, either an external editor like Notepad, or a built-in editor. The program file contains a reference to the raw data file (in the \texttt{infile} statement), so SAS knows where to find the data. In the first example we will consider below, the program file is called \texttt{sleep1.sas}. SAS expects program files to have the extension \texttt{.sas}, and you should always follow this convention. \item \textbf{The Log File}: This file is produced by every SAS run, whether it is successful or unsuccessful. It contains a listing of the command file, as well any error messages or warnings. The name of the log file is automatically generated by SAS; It will be something like \texttt{reading1.log} or \texttt{reading1-log.html}. \item \textbf{The Output File}: The output file contains the output of the statistical procedures requested in the program file. Output files have names like \texttt{reading1-Results.pdf}, \texttt{reading1-Results.rtf}, or \texttt{reading1-Results.html}. A successful SAS run will almost always produce an output file. The absence of an output file indicates that there was at least one fatal error. The presence of an output file file does not mean there were no errors; it just means that SAS was able to do \emph{some} of what you asked it to do. Even if there are errors, the output file will usually not contain any error messages; they will be in the log file. \end{itemize} \section{SAS University Edition} The SAS Institute make a great deal of money selling software licences to corporations, universities, government agencies, and to a lesser extent, individuals. Perhaps under pressure from the free R statistical software, they have recently been offering their core product free of charge to anyone with a university email address. It's called SAS University Edition. It's so well-designed and so convenient that it's difficult to imagine a professor choosing any other version of SAS for a statistics class. Here's the link: \begin{center} \href{http://www.sas.com/en_us/software/university-edition.html} {\small\texttt{http://www.sas.com/en\_us/software/university-edition.html}} \end{center} % Details will be left to lecture, but here are a few comments and suggestions that may be helpful. Regardless of operating system, SAS University Edition lives in a virtual \texttt{linux} machine.\footnote{A virtual computer is a set of software instructions that act like a complete, separate computer. So, for example, you could have a software version of the original IBM PC with the DOS operating system running on a modern laptop. Virtual machines are great for preserving legacy data and software, experimenting with viruses, and many other uses. In the bad old days, all the hardware in a virtual machine was represented by software instructions, and they were \emph{slow}. Now they can use the hardware of the host computer more directly, and there's not much of a performance hit.} In addition to having SAS installed, the \texttt{linux} machine is a Web server. But the web pages it hosts are not available to the entire internet. They are available only to you. Rather than having a proper IP address, the virtual \texttt{linux} machine has a \texttt{localhost} address: \texttt{http://localhost:10080}. With SAS running in the virtual machine, you point your browser to this address. It looks like you are on the Internet, but really you are on a network located within your computer. It's entirely local, and would work at the bottom of a coal mine. The browser interface (actually a website located on the virtual \texttt{linux} machine) is called SAS Studio. It's really nice, with tabs rather than separate windows for the program, log and output files. You can print files from the browser, or save output in \texttt{pdf}, \texttt{rtf} or \texttt{html} format. Because you are interacting with SAS indirectly through Web pages, the operating system on your computer does not matter much, if at all. If you are running Firefox on a Windows PC and I am running Safari on a Mac, the only differences we will experience are differences between Firefox and Safari. It's truly platform independent. You get your data into SAS via a shared folder -- shared between your computer and the virtual \texttt{linux} machine. In the \texttt{infile} sttement of your SAS job, begin the name of the data file with "\texttt{/folders/myfolders/}" That's the path to the shared folder on the virtual linux machine. The shared folder on \emph{your} machine can be anywhere. When you create the shared folder on your machine, make sure the spelling and capitalization of the folder names is exactly according to instructions. On your machine, the shared folder must be called \texttt{SASUniversityEdition}, with a sub-folder called \texttt{myfolders}. Sub-folders inside the folder \texttt{myfolders} are okay. \section{Example 1: Student's sleep data} \subsection{The raw data file} The following illustrates a simple SAS run. The first step was to get the raw data file. It's a classic: the data that Student (William Gossett) used to illustrate the $t$-test in the 1908 \emph{Biometrika} paper where he first reported it~\cite{student}. These data are given in Gossett's paper. I created a plain-text version of the raw data file called \texttt{studentsleep.data.txt} by typing the numbers into a text editor and dragging the file to the \texttt{myfolders} sub-folder of the shared folder \texttt{SASUniversityEdition}. Here's the data file. Take a look. \begin{verbatim} Patient Drug 1 Drug 2 1 0.7 1.9 2 -1.6 0.8 3 -0.2 1.1 4 -1.2 0.1 5 -0.1 -0.1 6 3.4 4.4 7 3.7 5.5 8 0.8 1.6 9 0.0 4.6 10 2.0 3.4 \end{verbatim} Actually, it's so obvious that you should look at your data that it is seldom mentioned. But experienced data analysts always do it --- or else they assume everything is okay and get a bitter lesson in something they already knew. This is so important that it gets the formal status of a \textbf{data analysis hint}. \begin{hint} Always look at your raw data file. It the data file is big, do it anyway. At least scroll through it, looking for anything strange. Check the values of all the variables for a few cases. Do they make sense? If you have obtained the data file from somewhere, along with a description of what's in it, never believe that the description you have been given is completely accurate. \end{hint} The file \texttt{studentsleep.data.txt} contains two variables for ten patients suffering from insomnia. Notice the variable names on the first line. Some software (like R) can use this information. As far as I know, SAS cannot. Furthermore, if SAS tries to read the data and encounters characters where it expects numbers, the results are unpleasant. One solution is to edit the raw data file and get rid of the labels, but actually labels like this can be useful. We'll get SAS to skip the first line, and start reading data from line two. Each variable is actually a difference, representing how much \emph{extra} sleep a patient got when taking a sleeping pill. Drug 1 is Dextro-hyoscyamine hydrobomide, while Drug 2 is Laevo-hyoscyamine hydrobomide. We want to know whether each drug is effective, and also which drug is more effective. Following Gosset, we'll use one-sample $t$-tests to decide whether each drug is effective; since these one-sample $t$-tests are carried out on differences, they are matched $t$-tests. We'll also compute a matched $t$-test comparing Drug 1 and Drug 2. Notice that this is a within-cases design. To analyze the data with SAS, we need to create another plain text file containing the SAS program. SAS Studio has a nice built-in editor, and you can compose the whole SAS program with that. Or, you can do the first draft using an external text editor, drag it to \texttt{myfolders}, and then edit it there using the built-in SAS editor. If you do it this way, just make sure the program file has the extension \texttt{.sas}. For Student's sleep data, my program is called \texttt{sleep1.sas}. \subsection{Structure of the Program File} A SAS program file is composed of units called \emph{data steps} and \emph{proc steps}. The typical SAS program has one \texttt{data} step and at least one \texttt{proc} step, though other structures are possible. \begin{itemize} \item Most SAS commands belong either in \texttt{data} step or in a \texttt{proc} step; they will generate errors if they are used in the wrong kind of step. \item Some statements, like the \texttt{title} and \texttt{options} commands, exist outside of the \texttt{data} and \texttt{proc} steps, but there are relatively few of these. \end{itemize} \paragraph{The Data Step} The \texttt{data} step takes care of data acquisition and modification. It almost always includes a reference to at least one raw data file, telling SAS where to look for the data. It specifies variable names and labels, and provides instructions about how to read the data; for example, the data might be read from fixed column locations. Variables from the raw data file can be modified, and new variables can be created. Each data step creates a \textbf{SAS data table}, a file consisting of the data (after modifications and additions), labels, and so on. Statistical procedures operate on SAS data tables, so you must create a SAS data table before you can start computing any statistics. A SAS data table is written in a binary format that is very convenient for SAS to process, but is not readable by humans. In the old days, SAS data tables were written to temporary scratch files on the computer's hard drive; these days, they may be maintained in RAM if they are small enough. In any case, the default is that a SAS data tab;e disappears after the job has run. If the data step is executed again in a later run, the SAS data set is re-created. Actually, it is possible to save a SAS data table on disk for later use. We won't do this here, but it makes sense when the amount of processing in a data step is large relative to the speed of the computer. As an extreme example, one of my colleagues uses SAS to analyze data from Ontario hospital admissions; the data files have millions of cases. Typically, it takes around 20 hours of CPU time on a very strong \texttt{unix} machine just to read the data and create a SAS data table. The resulting file, hundreds of gigabytes in size, is saved to disk, and then it takes just a few minutes to carry out each analysis. You wouldn't want to try this on a PC. SAS data tables are not always created by SAS data steps. Some statistical procedures can create SAS data tables, too. For example, \texttt{proc standard} can take an ordinary SAS data tables as input, and produce an output data table that has all the original variables, and also some of the variables converted to $z$-scores (by subtracting off the mean and dividing by the standard deviation). \texttt{Proc reg} (the main multiple regression procedure) can produce a SAS data table containing residuals for plotting and use in further analysis; there are many other examples. \paragraph{The \texttt{proc} Step} ``Proc" is short for procedure. Most procedures are statistical procedures; the most noticeable exception is \texttt{proc format}, which is used to provide labels for the values of categorical variables. The \texttt{proc} step is where you specify a statistical procedure that you want to carry out. A statistical procedures in the \texttt{proc} step will take a SAS data table as input, and write the results (summary statistics, values of test statistics, $p$-values, and so on) to the output file. The typical SAS program includes one \texttt{data} step and several \texttt{proc} steps, because it is common to produce a variety of data displays, descriptive statistics and significance tests in a single run. \subsection{\texttt{sleep1.sas}} Now we will look at \texttt{sleep1.sas} in some detail. This program is very simple; it has just one data step and two proc steps. {\small \begin{verbatim} /* sleep1.sas */ title "t-tests on Student's Sleep data"; data bedtime; infile '/folders/myfolders/studentsleep.data.txt' firstobs=2; /* Skip the header */ input patient xsleep1 xsleep2; sleepdif = xsleep2-xsleep1; /* Create a new variable */ proc print; var patient xsleep1 xsleep2 sleepdif; proc means n mean stddev t probt; var xsleep1 xsleep2 sleepdif; \end{verbatim} } % End size \noindent Here are some detailed comments about \texttt{sleep1.sas}. \begin{itemize} \item The first line is a comment. Anything between a \texttt{/*} and \texttt{*/} is a comment, and will be listed on the log file but otherwise ignored by SAS. Comments can appear anywhere in a program. You are not required to use comments, but it's a good idea. The most common error associated with comments is to forget to end them with \texttt{*/}. In the case of \texttt{sleep1.sas}, leaving off the \texttt{*/} (or typing \verb|/*| again by mistake) would cause the whole program to be treated as a comment. It would generate no errors, and no output --- because as far as SAS would be concerned, you never requested any. A longer program would eventually exceed the default length of a comment (it's some large number of characters) and SAS would end the ``comment" for you. At exactly that point (probably in the middle of a command) SAS would begin parsing the program. Almost certainly, the first thing it examined would be a fragment of a legal command, and this would cause an error. The log file would say that the command caused an error, and not much else. It would be \emph{very} confusing, because probably the command would be okay, and there would be no indication that SAS was only looking at part of it. \item The next two lines (the \texttt{options} statement and the \texttt{title} statement) exist outside the proc step and outside the data step. This is fairly rare. \item All SAS statements end with a semi-colon (\texttt{;}). SAS statements can extend for several physical lines in the program file. Spacing, indentation, breaking up s statement into several lines of text -- these are all for the convenience of the human reader, and are not part of the SAS syntax. \item By far the most common error in SAS programming is to forget the semi-colon. When this happens, SAS tries to interpret the following statement as part of the one you forgot to end. This often causes not one error, but a cascading sequence of errors. The rule is, \emph{if you have an error and you do not immediately understand what it is, look for a missing semi-colon.} It will probably be \emph{before} the portion of the program that (according to SAS) caused the first error. \item Cascading errors are not caused just by the dreaded missing semi-colon. They are common in SAS; for example, a runaway comment statement can easily cause a chain reaction of errors (if the program is long enough for it to cause any error messages at all). \emph{If you have a lot of errors in your log file, fix the first one and re-run the job; and don't waste time trying to figure out the others.} Some or all of them may well disappear. \item \texttt{title} This is optional, but recommended. The material between the quotes will appear at the top of each page. This can be a lifesaver when you are searching through a stack of old printouts for something you did a year or two ago. \item \texttt{data bedtime;} This begins the data step, specifying that the name of the SAS data set being created is ``bedtime." The names of data sets are arbitrary, but you should make them informative. They should begin with letters. \item \texttt{infile} Specifies the name of the raw data file. It must begin with \texttt{/folders/myfolders/}, the path to the shared folder in the virtual \texttt{linux} machine. \item \texttt{firstobs=2} Begin reading the data on line two, skipping the variable names. You can skip any number of lines this way, so a data file could potentially begin with a long description of how the data were collected. \item \texttt{input} Gives the names of the variables. \begin{itemize} \item Variable names should begin with a letter. Avoid special characters like \$ or \#. The variable names will be used to specify statistical procedures requested in the \texttt{proc} step. They should be meaningful (related to what the variable \emph{is}), and easy to remember. \item This is almost the simplest possible form of the \texttt{input} statement. It can be very powerful; for example, you can read data from different locations and in different orders, depending on the value of a variable you've just read, and so on. It can get complicated, but if the data file has a simple structure, the input statement can be simple too. \end{itemize} \item \texttt{sleepdif = xsleep2-xsleep1;} Create a new variable, representing how much more sleep the patient got with Drug 2, compared to Drug 1. This calculation is performed for each case in the data file. Notice that the new variable \texttt{sleepdif} does \emph{not} appear in the \texttt{input} statement. When some variables are to be created from others, it is a very good idea to do the computation within SAS. This makes raw data files smaller and more manageable, and also makes it easier to correct or re-define the computed variables. \item \texttt{proc print;} Now the first \texttt{proc} step begins. All we are doing is to list the data to make sure we have computed \texttt{sleepdif} correctly. This is actually a good thing to do whenever you compute a new variable. Of course you never (or very seldom) make hard copy of the complete output of \texttt{proc print}, because it can be very long. Once you're confident the data are what you think, delete the \texttt{proc print}. \item \texttt{var patient xsleep1 xsleep2 sleepdif;} List the variables you want to print. The word ``\texttt{var}" is obligatory, and is among a fairly large number of names reserved by the SAS system. If you tried to name one of your variables \texttt{var}, it wouldn't let you. \item \texttt{proc means;} This is the second \texttt{proc} step. Proc means is most often used to produce simple summary statistics for quantitative variables. The words \texttt{n mean stddev t probt} are optional, and specify that we want to see the following for each variable specified: the sample size, mean, standard deviation, $t$-test for testing whether the mean is different from zero, and the two-tailed $p$-value for the $t$-test. These are the paired $t$-tests we want. With just \texttt{proc means;} and not the option, we would get the default statistics: $n$, mean, standard deviation, minimum and maximum. These last two statistics are very useful, because they can alert you to outliers and errors in the data. \item \texttt{var} is obligatory. It is followed by a list of the variables for which you want to see means and other statistics. \end{itemize} \subsection{\texttt{sleep1.log}} Log files are not very interesting when everything is okay, but here is an example anyway. Notice that in addition to a variety of technical information (where the files are, how long each step took, and so on), it contains a listing of the SAS program --- in this case, \texttt{sleep1.sas}. If there were syntax errors in the program, this is where the error messages would appear. %\begin{scriptsize} \begin{verbatim} 1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 55 56 /* sleep1.sas */ 57 title "t-tests on Student's Sleep data"; 58 59 data mylatebedtime; 60 infile '/folders/myfolders/studentsleep.data.txt' firstobs=2; /* Skip the header */ 61 input patient xsleep1 xsleep2; 62 sleepdif = xsleep2-xsleep1; /* Create a new variable */ 63 NOTE: The infile '/folders/myfolders/studentsleep.data.txt' is: Filename=/folders/myfolders/studentsleep.data.txt, Owner Name=root,Group Name=vboxsf, Access Permission=-rwxrwx---, Last Modified=05Jan2016:14:26:25, File Size (bytes)=544 NOTE: 10 records were read from the infile '/folders/myfolders/studentsleep.data.txt'. The minimum record length was 47. The maximum record length was 47. NOTE: The data set WORK.MYLATEBEDTIME has 10 observations and 4 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 64 proc print; 65 var patient xsleep1 xsleep2 sleepdif; 66 NOTE: There were 10 observations read from the data set WORK.MYLATEBEDTIME. NOTE: PROCEDURE PRINT used (Total process time): real time 0.04 seconds cpu time 0.05 seconds 67 proc means n mean stddev t probt; 68 var xsleep1 xsleep2 sleepdif; 69 70 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; 82 \end{verbatim} %\end{scriptsize} \subsection{Output file} Here is the output file. Notice that the title specified in the \texttt{title} statement appears at the top. Then we get statistical output --- in this case, the listing of raw data and table of means and $t$-tests. %\begin{scriptsize} \begin{verbatim} t-tests on Student's Sleep data 1 Obs patient xsleep1 xsleep2 sleepdif 1 1 0.7 1.9 1.2 2 2 -1.6 0.8 2.4 3 3 -0.2 1.1 1.3 4 4 -1.2 0.1 1.3 5 5 -0.1 -0.1 0.0 6 6 3.4 4.4 1.0 7 7 3.7 5.5 1.8 8 8 0.8 1.6 0.8 9 9 0.0 4.6 4.6 10 10 2.0 3.4 1.4 t-tests on Student's Sleep data 2 The MEANS Procedure Variable N Mean Std Dev t Value Pr > |t| --------------------------------------------------------------------- xsleep1 10 0.7500000 1.7890097 1.33 0.2176 xsleep2 10 2.3300000 2.0022487 3.68 0.0051 sleepdif 10 1.5800000 1.2299955 4.06 0.0028 --------------------------------------------------------------------- \end{verbatim} %\end{scriptsize} The output is pretty self-explanatory. The $t$-tests do not provide convincing evidence that Drug 1 was effective. They suggest that Drug 2 was effective, and better than Drug 1. \newpage \enlargethispage*{1000 pt} \subsection{Reading from an Excel spreadsheet} \label{EXCEL} For convenience (my convenience), most of the data files used in this textbook are in plain text format. I have had most of them for quite a while. Data collected more recently tend to be in Microsoft Excel spreadsheets. Whether you find this repulsive or not, it is a fact of life. The following will serve as a model for reading data directly from an Excel spreadsheet. I pasted Student's sleep data into a spreadsheet called \texttt{sleep1.xlsx}. Here it is. Notice that the column names should be valid SAS names, with no embedded blanks. When the file type is \texttt{xlsx} watch out for leading and trailing blanks too. If you ignore this advice, SAS will convert the blanks to underscore characters (\_)\footnote{This is true as of SAS Version 9.4}, and you will need to look carefully at your log file to see what the variable names are. \vspace{20mm} \begin{center} \includegraphics[width=5in]{SleepSheet} \end{center} \newpage \noindent Here's the SAS program. \begin{verbatim} /* sleep1c.sas */ title "t-tests on Student's Sleep data"; title2 'Read data from Excel Spreadsheet'; proc import datafile="/folders/myfolders/sleep1.xlsx" out=sleepy dbms=xlsx replace; getnames=yes; /* Input data file is sleep1.xlsx Ouput data table is called sleepy dbms=xls The input file is an Excel spreadsheet. Necessary to read an Excel spreadsheet directly under unix/linux Works in PC environment too except for Excel 4.0 spreadsheets If there are multiple sheets, use sheet="sheet1" or something. replace If the data table already exists, replace it. Use this! getnames=yes Use column names as variable names. Beware of leading and trailing blanks */ /* proc print; */ data sleepy2; set sleepy; /* Now sleepy2=sleepy */ sleepdif = Drug2-Drug1; /* Create a new variable */ proc print; var patient Drug1 drug2 sleepdif; proc means n mean stddev t probt; var drug1 drug2 sleepdif; \end{verbatim} After the title, the first part of the program is a \texttt{proc import}, which imports the data into SAS. The code is thoroughly commented, but here are some details anyway. \begin{itemize} \item \texttt{proc import} \begin{itemize} \item \texttt{out=sleepy} creates a new data table called \texttt{sleepy}. \item \texttt{dbms=xlsx} specifies that it's an \texttt{xlsx} spreadsheet. This specification is necessary to read an Excel spreadsheet directly under unix/linux. According to the manuals, it works in a Windows environment too except for Excel 4.0 spreadsheets. If you are reading a spreadsheet in the older \texttt{xls} format, just replace \texttt{xlsx} with \texttt{xls} throughout. \item \texttt{replace}: If the data table already exists, replace it. Always use this! If you do not, any corrections you make to the spreadsheet will be ignored. \item \texttt{getnames=yes}: Use column names as variable names. Beware of leading and trailing blanks. \end{itemize} \item \texttt{proc print;} This is commented out. It was used to verify that the data were imported correctly. This is highly recommended. You will ultimately save time by cross-checking everything you can. \item \texttt{data sleepy2;} This data step creates a new data table called \texttt{sleepy2}. The \texttt{proc~import} created the data table \texttt{sleepy}, but you can't get at it directly to do anything else. The solution is to put the contents of \texttt{sleepy} into a new data table and modify that. \begin{itemize} \item \texttt{set sleepy;} This brings the contents of \texttt{sleepy} into \texttt{sleepy2}. \item \texttt{sleepdif = Drug2-Drug1;} This creates the new variable \texttt{sleepdiff}. Now it's possible to compute more new variables, add labels and do all the other things you'd do in a \texttt{data} step. \end{itemize} \end{itemize} The rest is the same as the original example, except that I played with the capitalization of variable names to remind you that SAS is not very case sensitive. \section{SAS Example Two: The statclass data}\label{statclass} These data come from a statistics class taught many years ago. Students took eight quizzes, turned in nine computer assignments, and also took a midterm and final exam. The data file also includes gender and ethnic background; these last two variables are just guesses by the professor, and there is no way to tell how accurate they were. The data file looks like this. There are 21 columns and 62 rows of data; columns are not aligned and there are no column headers. Here are the first few lines. \begin{verbatim} 1 2 9 1 7 8 4 3 5 2 6 10 10 10 5 0 0 0 0 55 43 0 2 10 10 5 9 10 8 6 8 10 10 8 9 9 9 9 10 10 66 79 1 2 10 10 5 10 10 10 9 8 10 10 10 10 10 10 9 10 10 94 67 1 2 10 10 8 9 10 7 10 9 10 10 10 9 10 10 9 10 10 81 65 0 1 10 1 0 0 8 6 5 2 10 9 0 0 10 6 0 5 0 54 . 1 1 10 6 7 9 8 8 5 7 10 9 10 9 5 6 4 8 10 57 52 0 1 0 0 9 9 10 5 2 2 8 7 7 10 10 6 3 7 10 49 . 0 1 10 9 5 8 9 8 5 6 8 7 5 6 10 6 5 9 9 77 64 0 1 10 8 6 8 9 5 3 6 9 9 6 9 10 6 5 7 10 65 42 1 1 10 5 6 7 10 4 6 0 10 9 10 9 10 6 7 8 10 73 . 0 1 9 0 4 6 10 5 3 3 10 8 10 5 10 10 9 9 10 71 37 \end{verbatim} \begin{center} \vdots \end{center} Notice the periods at the ends of lines 5, 7 and 10. The period is the SAS \emph{missing value code}. These people did not show up for the final exam. They may have taken a makeup exam, but if so their scores did not make it into this data file. When a case has a missing value recorded for a variable, SAS automatically excludes that case from any statistical calculation involving the variable. If a new variable is being created based on the value of a variable with a missing value, the new variable will usually have a missing value for that case too. Here is the SAS program \texttt{statmarks1.sas}. It reads and labels the data, and then does a variety of significance tests. They are all elementary except the last one, which illustrates testing for one set of explanatory variables controlling for another set in multiple regression. \begin{verbatim} /* statmarks1.sas */ title 'Grades from STA3000 at Roosevelt University: Fall, 1957'; title2 'Illustrate Elementary Tests'; proc format; /* Used to label values of the categorical variables */ value sexfmt 0 = 'Male' 1 = 'Female'; value ethfmt 1 = 'Chinese' 2 = 'European' 3 = 'Other' ; data grades; infile '/folders/myfolders/statclass1.data.txt'; input sex ethnic quiz1-quiz8 comp1-comp9 midterm final; /* Drop lowest score for quiz & computer */ quizave = ( sum(of quiz1-quiz8) - min(of quiz1-quiz8) ) / 7; compave = ( sum(of comp1-comp9) - min(of comp1-comp9) ) / 8; label ethnic = 'Apparent ethnic background (ancestry)' quizave = 'Quiz Average (drop lowest)' compave = 'Computer Average (drop lowest)'; mark = .3*quizave*10 + .1*compave*10 + .3*midterm + .3*final; label mark = 'Final Mark'; diff = quiz8-quiz1; /* To illustrate matched t-test */ label diff = 'Quiz 8 minus Quiz 1'; mark2 = round(mark); /* Bump up at grade boundaries */ if mark2=89 then mark2=90; if mark2=79 then mark2=80; if mark2=69 then mark2=70; if mark2=59 then mark2=60; /* Assign letter grade */ if mark2=. then grade='Incomplete'; else if mark2 ge 90 then grade = 'A'; else if 80 le mark2 le 89 then grade='B'; else if 70 le mark2 le 79 then grade='C'; else if 60 le mark2 le 69 then grade='D'; else grade='F'; format sex sexfmt.; /* Associates sex & ethnic */ format ethnic ethfmt.; /* with formats defined above */ proc freq; title3 'Frequency distributions of the categorical variables'; tables sex ethnic grade; proc means; title3 'Means and SDs of quantitative variables'; var quiz1 -- mark; /* single dash only works with numbered lists, like quiz1-quiz8 */ proc ttest; title3 'Independent t-test'; class sex; var mark; proc means n mean std t probt; title3 'Matched t-test: Quiz 1 versus 8'; var quiz1 quiz8 diff; proc glm; title3 'One-way anova'; class ethnic; model mark = ethnic; means ethnic; means ethnic / Tukey Bon Scheffe; proc freq; title3 'Chi-squared Test of Independence'; tables sex*ethnic sex*grade ethnic*grade / chisq; proc freq; /* Added after seeing warning from chisq test above */ title3 'Chi-squared Test of Independence: Version 2'; tables sex*ethnic grade*(sex ethnic) / norow nopercent chisq expected; proc corr; title3 'Correlation Matrix'; var final midterm quizave compave; proc plot; title3 'Scatterplot'; plot final*midterm; /* Really should do all combinations */ proc reg; title3 'Simple regression'; model final=midterm; /* Predict final exam score from midterm, quiz & computer */ proc reg simple; title3 'Multiple Regression'; model final = midterm quizave compave / ss1; smalstuf: test quizave = 0, compave = 0; \end{verbatim} \noindent Noteworthy features of this program include \begin{itemize} \item \texttt{options}: Already discussed in connection with \texttt{sleep1.sas}. \item \texttt{title2}: Subtitle \item \texttt{proc format}: This is a non-statistical procedure -- a rarity in the SAS language. It is the way SAS takes care of labelling categorical variables when the categories are coded as numbers. \texttt{proc format} defines \emph{printing formats}. For any variable associated with the printing format named \texttt{sexfmt}, any time it would print the value ``0" (in a table or something) it instead prints the string ``Male." The associations between variables and printing formats are accomplished in the \texttt{format} statement at the end of the data step. The names of formats have a period at the end to distinguish them from variable names. Of course formats must be defined before they can be associated with variables. This is why \texttt{proc format} precedes the data step. \item \texttt{quiz1-quiz8}: One may refer to a \emph{range} of variables ending with consecutive numbers using a minus sign. In the \texttt{input} statement, a range can be defined (named) this way. It saves typing and is easy to read. \item Creating new variables with assignment statements. The variables \texttt{quizave}, \texttt{compave} and \texttt{mark} are not in the original data file. They are created here, and they are appended to the end of the SAS data set in oder of creation. Variables like this should never be in the raw data file. \begin{hint} When variables are exact mathematical functions of other variables, always create them in the data step rather than including them in the raw data file. It saves data entry, and makes the data file smaller and easier to read. If you want to try out a different definition of the variable, it's easy to change a few statements in the data step. \end{hint} \item \texttt{sum(of quiz1-quiz8)}: Without the word ``of," the minus sign is ambiguous. In the SAS language, \texttt{sum(quiz1-quiz8)} is the sum of a single number, the difference between \texttt{quiz1} and \texttt{quiz8}. \item \texttt{format sex sexfmt.;} Associates the variable \texttt{sex} with its printing format. In questionnaire studies where a large number of items have the same potential responses (like a scale from 1 = Strongly Agree to 7=Strongly Disagree), it is common to associate a long list of variables with a single printing format. \item \texttt{quiz1 -- mark} in the first \texttt{proc means}: A double dash refers to a list of variables \emph{in the order of their creation} in the \texttt{data} step. Single dashes are for numerical order, while double dashes are for order of creation; it's very handy. \item Title inside a procedure labels just that procedure. \item \texttt{proc means n mean std t} A matched t-test is just a single-variable t-test carried out on differences, testing whether the mean difference is equal to zero. \item \texttt{proc glm} \begin{itemize} \item \texttt{class} Tells SAS that the explanatory variable ethnic is categorical. \item \texttt{model} Response variable(s) = explanatory variable(s) \item \texttt{means ethnic}: Mean of \texttt{mark} separately for each value of \texttt{ethnic}. \item \texttt{means ethnic / Tukey Bon Scheffe}: Post hoc tests (multiple comparisons, probing, follow-ups). Used if the overall $F$-test is significant, to see which means are different from which other means. \end{itemize} \item \texttt{chisq} option on \texttt{proc freq}: Gives a large collection of chisquare tests. The first one is the familiar Pearson chisquare test of independence (the one comparing observed and expected frequencies). \item \texttt{tables sex*ethnic / norow nopercent chisq expected;} In this second version of the crosstab produced \texttt{proc freq}, we suppress the row and total percentages, and look at the expected frequencies because SAS warned us that some of them were too small. SAS issues a warning if any expected frequency is below 5; this is the old-fashioned rule of thumb. But it has been known for some time that Type~I error rates are affected mostly by expected frequencies smaller than one, not five --- so I wanted to take a look. \item \texttt{proc corr} After \texttt{var}, list the variables you want to see in a correlation matrix. \item \texttt{proc plot; plot final*midterm;} Scatterplot: First variable named goes on the $y$ axis. \item \texttt{proc reg}: \texttt{model} Response variable(s) = explanatory variable(s) again \item \texttt{simple} option on \texttt{proc reg} gives simple descriptive statistics. This last procedure is an example of multiple regression, and we will return to it later once we have more background. \end{itemize} \pagebreak \paragraph{The output file} \begin{scriptsize} \begin{verbatim} _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 1 Illustrate Elementary Tests Frequency distributions of the categorical variables The FREQ Procedure Cumulative Cumulative sex Frequency Percent Frequency Percent ----------------------------------------------------------- Male 39 62.90 39 62.90 Female 23 37.10 62 100.00 Apparent ethnic background (ancestry) Cumulative Cumulative ethnic Frequency Percent Frequency Percent ------------------------------------------------------------- Chinese 41 66.13 41 66.13 European 15 24.19 56 90.32 Other 6 9.68 62 100.00 Cumulative Cumulative grade Frequency Percent Frequency Percent --------------------------------------------------------------- A 3 4.84 3 4.84 B 6 9.68 9 14.52 C 18 29.03 27 43.55 D 21 33.87 48 77.42 F 10 16.13 58 93.55 Incomplete 4 6.45 62 100.00 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 2 Illustrate Elementary Tests Means and SDs of quantitative variables The MEANS Procedure Variable Label N Mean Std Dev ---------------------------------------------------------------------------- quiz1 62 9.0967742 2.2739413 quiz2 62 5.8870968 3.2294995 quiz3 62 6.0483871 2.3707744 quiz4 62 7.7258065 2.1590022 quiz5 62 9.0645161 1.4471109 quiz6 62 7.1612903 1.9264641 quiz7 62 5.7903226 2.1204477 quiz8 62 6.3064516 2.3787909 comp1 62 9.1451613 1.1430011 comp2 62 8.8225806 1.7604414 comp3 62 8.3387097 2.5020880 comp4 62 7.8548387 3.2180168 comp5 62 9.4354839 1.7237109 comp6 62 7.8548387 2.4350364 comp7 62 6.6451613 2.7526248 comp8 62 8.8225806 1.6745363 comp9 62 8.2419355 3.7050497 midterm 62 70.1935484 13.6235557 final 58 50.3103448 17.2496701 quizave Quiz Average (drop lowest) 62 7.6751152 1.1266917 compave Computer Average (drop lowest) 62 8.8346774 1.1204997 mark Final Mark 58 68.4830049 10.3902874 ---------------------------------------------------------------------------- Variable Label Minimum Maximum ------------------------------------------------------------------------ quiz1 0 10.0000000 quiz2 0 10.0000000 quiz3 0 10.0000000 quiz4 0 10.0000000 quiz5 4.0000000 10.0000000 quiz6 3.0000000 10.0000000 quiz7 0 10.0000000 quiz8 0 10.0000000 comp1 6.0000000 10.0000000 comp2 0 10.0000000 comp3 0 10.0000000 comp4 0 10.0000000 comp5 0 10.0000000 comp6 0 10.0000000 comp7 0 10.0000000 comp8 0 10.0000000 comp9 0 10.0000000 midterm 44.0000000 103.0000000 final 15.0000000 89.0000000 quizave Quiz Average (drop lowest) 4.5714286 9.7142857 compave Computer Average (drop lowest) 5.0000000 10.0000000 mark Final Mark 48.4821429 95.4571429 ------------------------------------------------------------------------ _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 3 Illustrate Elementary Tests Independent t-test The TTEST Procedure Statistics Lower CL Upper CL Lower CL Variable sex N Mean Mean Mean Std Dev Std Dev mark Male 36 65.604 68.57 71.535 7.1093 8.7653 mark Female 22 62.647 68.341 74.036 9.8809 12.843 mark Diff (1-2) -5.454 0.2284 5.9108 8.8495 10.482 Statistics Upper CL Variable sex Std Dev Std Err Minimum Maximum mark Male 11.434 1.4609 54.057 89.932 mark Female 18.354 2.7382 48.482 95.457 mark Diff (1-2) 12.859 2.8366 T-Tests Variable Method Variances DF t Value Pr > |t| mark Pooled Equal 56 0.08 0.9361 mark Satterthwaite Unequal 33.1 0.07 0.9418 Equality of Variances Variable Method Num DF Den DF F Value Pr > F mark Folded F 21 35 2.15 0.0443 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 4 Illustrate Elementary Tests Matched t-test: Quiz 1 versus 8 The MEANS Procedure Variable Label N Mean Std Dev t Value --------------------------------------------------------------------------- quiz1 62 9.0967742 2.2739413 31.50 quiz8 62 6.3064516 2.3787909 20.87 diff Quiz 8 minus Quiz 1 62 -2.7903226 3.1578011 -6.96 --------------------------------------------------------------------------- Variable Label Pr > |t| ----------------------------------------- quiz1 <.0001 quiz8 <.0001 diff Quiz 8 minus Quiz 1 <.0001 ----------------------------------------- _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 5 Illustrate Elementary Tests One-way anova The GLM Procedure Class Level Information Class Levels Values ethnic 3 Chinese European Other Number of Observations Read 62 Number of Observations Used 58 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 6 Illustrate Elementary Tests One-way anova The GLM Procedure Dependent Variable: mark Final Mark Sum of Source DF Squares Mean Square F Value Pr > F Model 2 1238.960134 619.480067 6.93 0.0021 Error 55 4914.649951 89.357272 Corrected Total 57 6153.610084 R-Square Coeff Var Root MSE mark Mean 0.201339 13.80328 9.452898 68.48300 Source DF Type I SS Mean Square F Value Pr > F ethnic 2 1238.960134 619.480067 6.93 0.0021 Source DF Type III SS Mean Square F Value Pr > F ethnic 2 1238.960134 619.480067 6.93 0.0021 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 7 Illustrate Elementary Tests One-way anova The GLM Procedure Level of -------------mark------------ ethnic N Mean Std Dev Chinese 37 65.2688224 7.9262171 European 15 76.0142857 11.2351562 Other 6 69.4755952 13.3097753 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 8 Illustrate Elementary Tests One-way anova The GLM Procedure Tukey's Studentized Range (HSD) Test for mark NOTE: This test controls the Type I experimentwise error rate. Alpha 0.05 Error Degrees of Freedom 55 Error Mean Square 89.35727 Critical Value of Studentized Range 3.40649 Comparisons significant at the 0.05 level are indicated by ***. Difference ethnic Between Simultaneous 95% Comparison Means Confidence Limits European - Other 6.539 -4.460 17.538 European - Chinese 10.745 3.776 17.715 *** Other - European -6.539 -17.538 4.460 Other - Chinese 4.207 -5.814 14.228 Chinese - European -10.745 -17.715 -3.776 *** Chinese - Other -4.207 -14.228 5.814 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 9 Illustrate Elementary Tests One-way anova The GLM Procedure Bonferroni (Dunn) t Tests for mark NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than Tukey's for all pairwise comparisons. Alpha 0.05 Error Degrees of Freedom 55 Error Mean Square 89.35727 Critical Value of t 2.46941 Comparisons significant at the 0.05 level are indicated by ***. Difference ethnic Between Simultaneous 95% Comparison Means Confidence Limits European - Other 6.539 -4.737 17.814 European - Chinese 10.745 3.600 17.891 *** Other - European -6.539 -17.814 4.737 Other - Chinese 4.207 -6.067 14.480 Chinese - European -10.745 -17.891 -3.600 *** Chinese - Other -4.207 -14.480 6.067 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 10 Illustrate Elementary Tests One-way anova The GLM Procedure Scheffe's Test for mark NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than Tukey's for all pairwise comparisons. Alpha 0.05 Error Degrees of Freedom 55 Error Mean Square 89.35727 Critical Value of F 3.16499 Comparisons significant at the 0.05 level are indicated by ***. Difference ethnic Between Simultaneous 95% Comparison Means Confidence Limits European - Other 6.539 -4.950 18.027 European - Chinese 10.745 3.466 18.025 *** Other - European -6.539 -18.027 4.950 Other - Chinese 4.207 -6.260 14.674 Chinese - European -10.745 -18.025 -3.466 *** Chinese - Other -4.207 -14.674 6.260 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 11 Illustrate Elementary Tests Chi-squared Test of Independence The FREQ Procedure Table of sex by ethnic sex ethnic(Apparent ethnic background (ancestry)) Frequency| Percent | Row Pct | Col Pct |Chinese |European|Other | Total ---------+--------+--------+--------+ Male | 27 | 7 | 5 | 39 | 43.55 | 11.29 | 8.06 | 62.90 | 69.23 | 17.95 | 12.82 | | 65.85 | 46.67 | 83.33 | ---------+--------+--------+--------+ Female | 14 | 8 | 1 | 23 | 22.58 | 12.90 | 1.61 | 37.10 | 60.87 | 34.78 | 4.35 | | 34.15 | 53.33 | 16.67 | ---------+--------+--------+--------+ Total 41 15 6 62 66.13 24.19 9.68 100.00 Statistics for Table of sex by ethnic Statistic DF Value Prob ------------------------------------------------------ Chi-Square 2 2.9208 0.2321 Likelihood Ratio Chi-Square 2 2.9956 0.2236 Mantel-Haenszel Chi-Square 1 0.0000 0.9949 Phi Coefficient 0.2170 Contingency Coefficient 0.2121 Cramer's V 0.2170 WARNING: 33% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 62 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 12 Illustrate Elementary Tests Chi-squared Test of Independence The FREQ Procedure Table of sex by grade sex grade Frequency| Percent | Row Pct | Col Pct |A |B |C |D |F |Incomple| Total | | | | | |te | ---------+--------+--------+--------+--------+--------+--------+ Male | 1 | 3 | 13 | 14 | 5 | 3 | 39 | 1.61 | 4.84 | 20.97 | 22.58 | 8.06 | 4.84 | 62.90 | 2.56 | 7.69 | 33.33 | 35.90 | 12.82 | 7.69 | | 33.33 | 50.00 | 72.22 | 66.67 | 50.00 | 75.00 | ---------+--------+--------+--------+--------+--------+--------+ Female | 2 | 3 | 5 | 7 | 5 | 1 | 23 | 3.23 | 4.84 | 8.06 | 11.29 | 8.06 | 1.61 | 37.10 | 8.70 | 13.04 | 21.74 | 30.43 | 21.74 | 4.35 | | 66.67 | 50.00 | 27.78 | 33.33 | 50.00 | 25.00 | ---------+--------+--------+--------+--------+--------+--------+ Total 3 6 18 21 10 4 62 4.84 9.68 29.03 33.87 16.13 6.45 100.00 Statistics for Table of sex by grade Statistic DF Value Prob ------------------------------------------------------ Chi-Square 5 3.3139 0.6517 Likelihood Ratio Chi-Square 5 3.2717 0.6582 Mantel-Haenszel Chi-Square 1 0.2342 0.6284 Phi Coefficient 0.2312 Contingency Coefficient 0.2253 Cramer's V 0.2312 WARNING: 58% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 62 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 13 Illustrate Elementary Tests Chi-squared Test of Independence The FREQ Procedure Table of ethnic by grade ethnic(Apparent ethnic background (ancestry)) grade Frequency| Percent | Row Pct | Col Pct |A |B |C |D |F |Incomple| Total | | | | | |te | ---------+--------+--------+--------+--------+--------+--------+ Chinese | 0 | 2 | 11 | 17 | 7 | 4 | 41 | 0.00 | 3.23 | 17.74 | 27.42 | 11.29 | 6.45 | 66.13 | 0.00 | 4.88 | 26.83 | 41.46 | 17.07 | 9.76 | | 0.00 | 33.33 | 61.11 | 80.95 | 70.00 | 100.00 | ---------+--------+--------+--------+--------+--------+--------+ European | 2 | 4 | 5 | 3 | 1 | 0 | 15 | 3.23 | 6.45 | 8.06 | 4.84 | 1.61 | 0.00 | 24.19 | 13.33 | 26.67 | 33.33 | 20.00 | 6.67 | 0.00 | | 66.67 | 66.67 | 27.78 | 14.29 | 10.00 | 0.00 | ---------+--------+--------+--------+--------+--------+--------+ Other | 1 | 0 | 2 | 1 | 2 | 0 | 6 | 1.61 | 0.00 | 3.23 | 1.61 | 3.23 | 0.00 | 9.68 | 16.67 | 0.00 | 33.33 | 16.67 | 33.33 | 0.00 | | 33.33 | 0.00 | 11.11 | 4.76 | 20.00 | 0.00 | ---------+--------+--------+--------+--------+--------+--------+ Total 3 6 18 21 10 4 62 4.84 9.68 29.03 33.87 16.13 6.45 100.00 Statistics for Table of ethnic by grade Statistic DF Value Prob ------------------------------------------------------ Chi-Square 10 18.2676 0.0506 Likelihood Ratio Chi-Square 10 19.6338 0.0329 Mantel-Haenszel Chi-Square 1 5.6222 0.0177 Phi Coefficient 0.5428 Contingency Coefficient 0.4771 Cramer's V 0.3838 WARNING: 78% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 62 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 14 Illustrate Elementary Tests Chi-squared Test of Independence: Version 2 The FREQ Procedure Table of sex by ethnic sex ethnic(Apparent ethnic background (ancestry)) Frequency| Expected | Col Pct |Chinese |European|Other | Total ---------+--------+--------+--------+ Male | 27 | 7 | 5 | 39 | 25.79 | 9.4355 | 3.7742 | | 65.85 | 46.67 | 83.33 | ---------+--------+--------+--------+ Female | 14 | 8 | 1 | 23 | 15.21 | 5.5645 | 2.2258 | | 34.15 | 53.33 | 16.67 | ---------+--------+--------+--------+ Total 41 15 6 62 Statistics for Table of sex by ethnic Statistic DF Value Prob ------------------------------------------------------ Chi-Square 2 2.9208 0.2321 Likelihood Ratio Chi-Square 2 2.9956 0.2236 Mantel-Haenszel Chi-Square 1 0.0000 0.9949 Phi Coefficient 0.2170 Contingency Coefficient 0.2121 Cramer's V 0.2170 WARNING: 33% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 62 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 15 Illustrate Elementary Tests Chi-squared Test of Independence: Version 2 The FREQ Procedure Table of grade by sex grade sex Frequency | Expected | Col Pct |Male |Female | Total -----------+--------+--------+ A | 1 | 2 | 3 | 1.8871 | 1.1129 | | 2.56 | 8.70 | -----------+--------+--------+ B | 3 | 3 | 6 | 3.7742 | 2.2258 | | 7.69 | 13.04 | -----------+--------+--------+ C | 13 | 5 | 18 | 11.323 | 6.6774 | | 33.33 | 21.74 | -----------+--------+--------+ D | 14 | 7 | 21 | 13.21 | 7.7903 | | 35.90 | 30.43 | -----------+--------+--------+ F | 5 | 5 | 10 | 6.2903 | 3.7097 | | 12.82 | 21.74 | -----------+--------+--------+ Incomplete | 3 | 1 | 4 | 2.5161 | 1.4839 | | 7.69 | 4.35 | -----------+--------+--------+ Total 39 23 62 Statistics for Table of grade by sex Statistic DF Value Prob ------------------------------------------------------ Chi-Square 5 3.3139 0.6517 Likelihood Ratio Chi-Square 5 3.2717 0.6582 Mantel-Haenszel Chi-Square 1 0.2342 0.6284 Phi Coefficient 0.2312 Contingency Coefficient 0.2253 Cramer's V 0.2312 WARNING: 58% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 62 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 16 Illustrate Elementary Tests Chi-squared Test of Independence: Version 2 The FREQ Procedure Table of grade by ethnic grade ethnic(Apparent ethnic background (ancestry)) Frequency | Expected | Col Pct |Chinese |European|Other | Total -----------+--------+--------+--------+ A | 0 | 2 | 1 | 3 | 1.9839 | 0.7258 | 0.2903 | | 0.00 | 13.33 | 16.67 | -----------+--------+--------+--------+ B | 2 | 4 | 0 | 6 | 3.9677 | 1.4516 | 0.5806 | | 4.88 | 26.67 | 0.00 | -----------+--------+--------+--------+ C | 11 | 5 | 2 | 18 | 11.903 | 4.3548 | 1.7419 | | 26.83 | 33.33 | 33.33 | -----------+--------+--------+--------+ D | 17 | 3 | 1 | 21 | 13.887 | 5.0806 | 2.0323 | | 41.46 | 20.00 | 16.67 | -----------+--------+--------+--------+ F | 7 | 1 | 2 | 10 | 6.6129 | 2.4194 | 0.9677 | | 17.07 | 6.67 | 33.33 | -----------+--------+--------+--------+ Incomplete | 4 | 0 | 0 | 4 | 2.6452 | 0.9677 | 0.3871 | | 9.76 | 0.00 | 0.00 | -----------+--------+--------+--------+ Total 41 15 6 62 Statistics for Table of grade by ethnic Statistic DF Value Prob ------------------------------------------------------ Chi-Square 10 18.2676 0.0506 Likelihood Ratio Chi-Square 10 19.6338 0.0329 Mantel-Haenszel Chi-Square 1 5.6222 0.0177 Phi Coefficient 0.5428 Contingency Coefficient 0.4771 Cramer's V 0.3838 WARNING: 78% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 62 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 17 Illustrate Elementary Tests Correlation Matrix The CORR Procedure 4 Variables: final midterm quizave compave Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum final 58 50.31034 17.24967 2918 15.00000 89.00000 midterm 62 70.19355 13.62356 4352 44.00000 103.00000 quizave 62 7.67512 1.12669 475.85714 4.57143 9.71429 compave 62 8.83468 1.12050 547.75000 5.00000 10.00000 Simple Statistics Variable Label final midterm quizave Quiz Average (drop lowest) compave Computer Average (drop lowest) Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations final midterm quizave compave final 1.00000 0.47963 0.41871 0.06060 0.0001 0.0011 0.6513 58 58 58 58 midterm 0.47963 1.00000 0.59294 0.41277 0.0001 <.0001 0.0009 58 62 62 62 quizave 0.41871 0.59294 1.00000 0.52649 Quiz Average (drop lowest) 0.0011 <.0001 <.0001 58 62 62 62 compave 0.06060 0.41277 0.52649 1.00000 Computer Average (drop lowest) 0.6513 0.0009 <.0001 58 62 62 62 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 18 Illustrate Elementary Tests Scatterplot Plot of final*midterm. Legend: A = 1 obs, B = 2 obs, etc. final | | 90 + A | A | | | 80 + A A A | | | | A 70 + A A A | A | A A | A A A | 60 + A | A AA | A A | A A B A A | A A A A 50 + AA | A | A | AA | A C 40 + A A A A | A A A | | | 30 + A A A | A | A | AA | A 20 + A | | A | | 10 + | -+---------+---------+---------+---------+---------+---------+---------+- 40 50 60 70 80 90 100 110 midterm NOTE: 4 obs had missing values. _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 19 Illustrate Elementary Tests Simple regression The REG Procedure Model: MODEL1 Dependent Variable: final Number of Observations Read 62 Number of Observations Used 58 Number of Observations with Missing Values 4 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 3901.64751 3901.64751 16.73 0.0001 Error 56 13059 233.19226 Corrected Total 57 16960 Root MSE 15.27063 R-Square 0.2300 Dependent Mean 50.31034 Adj R-Sq 0.2163 Coeff Var 30.35287 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 6.88931 10.80304 0.64 0.5263 midterm 1 0.61605 0.15061 4.09 0.0001 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 20 Illustrate Elementary Tests Multiple Regression The REG Procedure Number of Observations Read 62 Number of Observations Used 58 Number of Observations with Missing Values 4 Descriptive Statistics Uncorrected Standard Variable Sum Mean SS Variance Deviation Intercept 58.00000 1.00000 58.00000 0 0 midterm 4088.00000 70.48276 298414 180.35935 13.42979 quizave 451.57143 7.78571 3576.51020 1.06498 1.03198 compave 515.50000 8.88793 4641.50000 1.04862 1.02402 final 2918.00000 50.31034 163766 297.55112 17.24967 Descriptive Statistics Variable Label Intercept Intercept midterm quizave Quiz Average (drop lowest) compave Computer Average (drop lowest) final _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 21 Illustrate Elementary Tests Multiple Regression The REG Procedure Model: MODEL1 Dependent Variable: final Number of Observations Read 62 Number of Observations Used 58 Number of Observations with Missing Values 4 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 4995.04770 1665.01590 7.51 0.0003 Error 54 11965 221.58085 Corrected Total 57 16960 Root MSE 14.88559 R-Square 0.2945 Dependent Mean 50.31034 Adj R-Sq 0.2553 Coeff Var 29.58754 Parameter Estimates Parameter Standard Variable Label DF Estimate Error Intercept Intercept 1 9.01839 19.02591 midterm 1 0.50057 0.18178 quizave Quiz Average (drop lowest) 1 4.80199 2.46469 compave Computer Average (drop lowest) 1 -3.53028 2.17562 Parameter Estimates Variable Label DF t Value Pr > |t| Type I SS Intercept Intercept 1 0.47 0.6374 146806 midterm 1 2.75 0.0080 3901.64751 quizave Quiz Average (drop lowest) 1 1.95 0.0566 509.97483 compave Computer Average (drop lowest) 1 -1.62 0.1105 583.42537 _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 22 Illustrate Elementary Tests Multiple Regression The REG Procedure Model: MODEL1 Test smalstuf Results for Dependent Variable final Mean Source DF Square F Value Pr > F Numerator 2 546.70010 2.47 0.0943 Denominator 54 221.58085 \end{verbatim} \end{scriptsize} %\noindent Multiple regression output was deleted. \paragraph{Data in fixed columns} When the data values have at least one space between them, the variables are recorded in the same order for each case, and missing values are indicated by periods, the default version of the \texttt{input} statement (list input) does the job perfectly. It is a bonus that the variables need not always be separated by the same number of spaces for each case. Also, there can be more than one line of data for each case, and in fact there need not even be the same number of data lines for all the cases, just as long as there are the same number of variables. Another common situation is for the data to be lined up in fixed columns, with blanks for missing values. Sometimes, especially when there are many variables, the data are \emph{packed} together, without spaces between values. For example, the Minnesota Multiphasic Personality Inventory (MMPI) consists of over 300 questions, all to be answered True or False. It would be quite natural to code 1=True and 0=False, and pack the data together. There would still be quite a few data lines for each case. Here is the beginning of the file \texttt{statclass2.dat}. It is the same as \texttt{statclass1.dat}, except that the data are packed together. Most of the blanks occur because two columns are reserved for the marks on quizzes and computer assignments, because 10 out of 10 is possible. Three columns are reserved for the midterm and final scores, because 100\% is possible. For all variables, missing values are represented by blanks. That is, if the field occupied by a variable is completely blank, it's a missing value. \begin{verbatim} 12 9 1 7 8 4 3 5 2 6101010 5 0 0 0 0 55 43 021010 5 910 8 6 81010 8 9 9 9 91010 66 79 121010 5101010 9 8101010101010 91010 94 67 121010 8 910 710 9101010 91010 91010 81 65 0110 1 0 0 8 6 5 210 9 0 010 6 0 5 0 54 1110 6 7 9 8 8 5 710 910 9 5 6 4 810 57 52 01 0 0 9 910 5 2 2 8 7 71010 6 3 710 49 0110 9 5 8 9 8 5 6 8 7 5 610 6 5 9 9 77 64 0110 8 6 8 9 5 3 6 9 9 6 910 6 5 710 65 42 1110 5 6 710 4 6 010 910 910 6 7 810 73 01 9 0 4 610 5 3 310 810 51010 9 910 71 37 \end{verbatim} \begin{center} \vdots \end{center} Now we will take a look at \texttt{statread.sas}. It contains just the \texttt{proc format} and the \texttt{data} step; There are no statistical procedures. This file will be read by programs that invoke statistical procedures, as you will see. \begin{verbatim} /* statread.sas Read the statclass data in fixed format, define and label variables. Use with %include '/folders/myfolders/statread.sas'; */ title 'Grades from STA3000 at Roosevelt University: Fall, 1957'; proc format; /* Used to label values of the categorical variables */ value sexfmt 0 = 'Male' 1 = 'Female'; value ethfmt 1 = 'Chinese' 2 = 'European' 3 = 'Other' ; data grades; infile '/folders/myfolders/statclass2.data' missover; input (sex ethnic) (1.) (quiz1-quiz8 comp1-comp9) (2.) (midterm final) (3.); /* Drop lowest score for quiz & computer */ quizave = ( sum(of quiz1-quiz8) - min(of quiz1-quiz8) ) / 7; compave = ( sum(of comp1-comp9) - min(of comp1-comp9) ) / 8; label ethnic = 'Apparent ethnic background (ancestry)' quizave = 'Quiz Average (drop lowest)' compave = 'Computer Average (drop lowest)'; mark = .3*quizave*10 + .1*compave*10 + .3*midterm + .3*final; label mark = 'Final Mark'; diff = quiz8-quiz1; /* To illustrate matched t-test */ label diff = 'Quiz 8 minus Quiz 1'; mark2 = round(mark); /* Bump up at grade boundaries */ if mark2=89 then mark2=90; if mark2=79 then mark2=80; if mark2=69 then mark2=70; if mark2=59 then mark2=60; /* Assign letter grade */ if mark2=. then grade='Incomplete'; else if mark2 ge 90 then grade = 'A'; else if 80 le mark2 le 89 then grade='B'; else if 70 le mark2 le 79 then grade='C'; else if 60 le mark2 le 69 then grade='D'; else grade='F'; format sex sexfmt.; /* Associates sex & ethnic */ format ethnic ethfmt.; /* with formats defined above */ /*************************************************************/ \end{verbatim} The data step in \texttt{statread.sas} differs from the one in \texttt{statmarks1.sas} in only two respects. First, the \texttt{missover} option on the infile statement causes blanks to be read as missing values even if they occur at the end of a line and the line just ends rather than being filled in with space characters. That is, such lines are shorter than the others in the file, and when SAS \texttt{over}-reads the end of the line, it sets all the variables it would have read to missing. This is what we want, so you should always use the \texttt{missover} option when missing values are represented by blanks. The other difference between this data step and the one in \texttt{statmarks1.sas} is in the \texttt{input} statement. Here, we are using \emph{formatted} input. \texttt{sex} and \texttt{ethnic} each occupy 1 column. \texttt{quiz1-quiz8} and \texttt{comp1-comp9} each occupy 2 columns. \texttt{midterm} and \texttt{final} each occupy 3 columns. You can supply a list of formats for each list of variables in parentheses, but if the number of formats is less than the number of variables, they are re-used. That's what's happening in the present case. It is also possible to specify the exact column location in which each variable resides. The \texttt{input} statement is very rich and powerful. The program \texttt{statread.sas} reads and defines the data, but it requests no statistical output; \texttt{statdescribe.sas} pulls in \texttt{statread.sas} using a \texttt{\%include} statement, and produces basic descriptive statistics. Significance tests would be produced by other short programs. Keeping the data definition in a separate file and using \texttt{\%include} (the only part of the powerful \emph{SAS macro language} presented here) is often a good strategy, because most data analysis projects involve a substantial number of statistical procedures. It is common to have maybe twenty program files that carry out various analyses. You \emph{could} have the data step at the beginning of each program, but in many cases the data step is long. And, what happens when (inevitably) you want to make a change in the data step and re-run your analyses? You find yourself making the same change in twenty files. Probably you will forget to change some of them, and the result is a big mess. If you keep your data definition in just one place, you only have to edit it once, and a lot of problems are avoided. \begin{verbatim} /* statdescribe.sas */ %include '/folders/myfolders/statread.sas'; title2 'Basic Descriptive Statistics'; proc freq; title3 'Frequency distributions of the categorical variables'; tables sex ethnic grade; proc means n mean std; title3 'Means and SDs of quantitative variables'; var quiz1 -- mark2; /* single dash only works with numbered lists, like quiz1-quiz8 */ proc univariate normal; /* the normal option gives a test for normality */ title3 'Detailed look at mark and bumped mark (mark2)'; var mark mark2; \end{verbatim} \section{SAS Example Three: The Math data}\label{mathdata} The Math data come from a large multi-campus North American university. These are real data, and a fairly complete analysis will be spread throughout parts of this book. The objective is to illustrate some principles of data analysis that have practical importance, but are not exactly part of Statistics. The Math study came about because some professors and administrators at one of the campuses wanted to predict performance in first-year calculus so they could give better advice to students. For this purpose, one of the professors made up a 20-question multiple choice test; nine questions were on pre-calculus material, and eleven questions were based on the local cirriculum in high school calculus. The main question was whether this diagnostic test was useful. That is, if you knew what courses the students took in high school and how well they did, would your predictions be more accurate if you also had their scores on the diagnostic test? And is so, \emph{how much} more accurate would the predictions be? To find out, all the students who signed up for first-year calculus at one of the campuses were asked to take the diagnostic test in the week before classes started. Most of them (a total of ) did so. At the end of the school year their calculus marks were recorded. This this mark, a number from zero to one hundred, was the main dependent variable. But of course not all students remained in the class; some withdrew, and some disappeared in other ways. The reasons for their disappearance were varied, and not part of the data set. Obviously, predictions of numerical grade can only be based on students who stayed in the course until the end, and any advice given to students about marks would have to start out with something like ``Assuming you stay in the course until the end, our best guess of your mark is \ldots" So a second, very important response variable was simply whether the student passed the course, Yes or No. Another potentially useful possibility would be Pass-Fail-Disappear, a categorical response variable with three categories. The diagnostic test provides at least two explanatory variables: number of pre-calculus questions correct, and number of calculus questions correct. In addition, high school transcripts were available. It is important to recognize that the information in these transcripts was not in a form that could be used directly in statistical analysis. Each transcript was a sizable plain text file --- actually, the disk image of old fashioned line printer output, designed to be printed on big sheets of paper 132 characters wide. There was a cumulative high school grade point average for most students, and also a mark in an upper level high school English course because it was required for admission to the university. In addition, most students in the sample had taken high school Calculus. Beyond that, they had mostly taken different courses from one another, including similar courses with names that were quite different, and different courses with names that were quite similar. Courses were listed in the order taken. Some students had withdrawn from certain courses more than once before completing them for credit, and some took the same course for credit more than once in an attempt to improve their mark. The second mark was usually higher, but not always. The point of all this is that while eventually we will analyze a nice orderly data file with rows corresponding to cases and columns corresponding to variables, data do not naturally come that way, most of the time. As mentioned in Data Analysis Hint~\ref{rowbycol} on page \pageref{rowbycol}, the row-by-column arrangement is something that is imposed on the data by the researchers who gather or analyze the data. Typically, this process involves a large number of semi-arbitrary but critically important decisions. In the math study, the number of variables that \emph{might} have been extracted from the high school transcripts is difficult even to estimate. For example, \emph{number} of math courses taken was an obvious possibility, but it was eliminated on the basis of preliminary analysis. Many other choices were made, and the details are largely undocumented and forgotten\footnote{This may be too bad, but it is typical of most research. On the positive side, it will be described shortly how the data were randomly divided into two sub-samples, and exploratory sample and a confirmatory sample. All the semi-arbitrary decisions were based on the exploratory sample \emph{only}}. In the end, the following variables were recorded for each student who took the diagnostic test. \begin{itemize} \item \item \item \item \end{itemize} \section{SAS Reference Materials} This course is trying to teach you SAS by example, without full explanation, and certainly without discussion of all the options. If you need more detail, the SAS Institute provides online documentation at \texttt{http://support.sas.com/documentation}. Most of the standard statistical procedures you are likely to use are under ``SAS/STAT." For information about the data step (for example, reading a complex data set), choose ``Base SAS Software" and then either ``SAS Language Reference: Concepts" or ``SAS Language Reference: Dictionary." The SAS Institute also publishes hard copy manuals, but most students will prefer the online version. Note that this is reference material. The SAS Institute also publishes a variety of manual-like books that are intended to be more instructional, most of them geared to specific statistical topics (like \emph{The SAS system for multiple regression} and \emph{The SAS system for linear models}). These are more readable than the reference manuals, though it helps to have a real textbook on the topic to fill in the gaps. A better place to start learning about SAS is a wonderful book by Cody and Smith~\cite{cs91} entitled \emph{Applied statistics and the SAS programming language}. They do a really good job of presenting and documenting the language of the data step, and and they also cover a set of statistical procedures ranging from elementary to moderately advanced. If you had to own just one SAS book, this would be it. If you consult \emph{any} SAS book or manual, you'll need to translate and filter out some details. Here is the main case. Many of the examples you see in Cody and Smith's book and elsewhere will not have separate files for the raw data and the program. They include the raw data in the program file in the data step, after a \texttt{datalines} or \texttt{cards} statement. Here is an example from page 3 of \cite{cs91}. \begin{verbatim} data test; input subject 1-2 gender $ 4 exam1 6-8 exam2 10-12 hwgrade $ 14; datalines; 10 M 80 84 A 7 M 85 89 A 4 F 90 86 B 20 M 82 85 B 25 F 94 94 A 14 F 88 84 C ; proc means data=test; run; \end{verbatim} Having the raw data and the SAS code together in one display is so attractive for small datasets that most textbook writers cannot resist it. But think how unpleasant it would be if you had 10,000 lines of data. The way we would do this example is to have the data file (named, say, \texttt{example1.dat}) in a separate file. The data file would look like this. \begin{verbatim} 10 M 80 84 A 7 M 85 89 A 4 F 90 86 B 20 M 82 85 B 25 F 94 94 A 14 F 88 84 C \end{verbatim} and the program file would look like this. \begin{verbatim} data test; infile '/folders/myfolders/example1.dat'; /* Read data from example1.dat */ input subject 1-2 gender $ 4 Exam1 6-8 exam2 10-12 hwgrade $ 14; proc means data=test; \end{verbatim} Using this as an example, you should be able to translate any textbook example into the program-file data-file format used in this book. \chapter{Comparing Several Means}\label{ONEWAY} \section{One-way analysis of variance}\label{ONEWAYANOVA} This chapter starts with the humble one-way (one-factor) analysis of variance (ANOVA). It is called \emph{one} way because there is a single categorical explanatory variable. This categorical explanatory variable, which may be either observed or experimentally manipulated, divides the sample into \emph{groups} of observations. The objective is to test for differences among means. Note that because the explanatory variable divides the cases into groups, it is a between-subjects factor. Within-subjects (repeated measures) techniques will be discussed later. \paragraph{Assumptions} The test assumes independent random sampling from each sub-population, and also that the response variable has a conditional distribution that is normal, with equal variances. That is, for each value of the categorical explanatory variable, there is a sub-population (perhaps hypothetical), and the response variable is normally distributed within that sub-population. While the population means of all the normal distributions may differ, their population variances are all identical. A normal distribution is completely specified by its mean and variance, and we are assuming that the variances are all equal. So if the means of the conditional distributions are also equal, then the conditional distributions are identical. This makes the explanatory and response variable \emph{unrelated} by the definition in Chapter~\ref{rmethods}. % Give page number! Thus we see that in the one-way ANOVA, the only possible kind of population relationship between the explanatory variable and the response variable is a difference among group means. The ``assumptions" of a statistical test actually represent a mathematical \emph{model} for the data, and that model is used to formally derive the test. Such derivations are always hidden in applied classes. But it makes a practical difference, because some assumptions are often violated in practice, and frequently these assumptions were adopted in the first place to make the model mathematically tractable, not because anybody seriously believed they would be true for the typical data set. Sometimes, the assumptions that allow the mathematical derivation of a test are not really necessary. The test might work, or anyway work pretty well, even if the assumptions are violated. When this is the case, the test is said to be \emph{robust} with respect to those assumptions. Usually, robustness is something that starts to happen as the sample size gets large, if it happens at all. When we say a test ``works," we mean two things \begin{itemize} \item It protects against Type~I error (false significance) at something close to the stated level. That is, if nothing is really going on, significant effects will be falsely detected at the 0.05 level not much more than 5\% of the time. \item The power of the test is reasonably good. At the very least, power (the probability of correctly rejecting the null hypothesis) increases when the relationship between explanatory variable and response variable becomes stronger, and also increases with the sample size, approaching one as the sample size approaches infinity for \emph{any} non-zero relationship between variables. \end{itemize} For the one-way analysis of variance (and for factorial\footnote{The term ``factor" is another term for categorical explanatory variable. Factorial research designs imply analyses with one or more categorical explanatory variables, usually more than one.} ANOVA in general) if the assumption of equal variances holds but the normal assumption does not, the test is robust for large samples. The rough rule would be $n=20$ to 25 for each group, though for data that are sufficiently non-normal, an arbitrarily large sample might be required. If the equal variances assumption is violated, then the test is robust for large samples if the sample sizes for each group are approximately equal. Here, the meaning of ``large" is murky. \paragraph{\emph{Analysis} of variance} The word \emph{analysis} means to take apart or split up, and in the analysis of variance, variation in the response variable is split into two components: variation of the data values that is explained by the explanatory variable (Sum of Squares Between groups), and variation that is left unexplained (Sum of Squares Within groups). Here's how it goes. Suppose we want to predict the value of a response variable, without using any explanatory variables yet. The best prediction (in the sense of least squares) is the sample mean. Subtract the sample mean from each response variable value, and we obtain a set of \emph{deviations} representing errors of prediction. Squaring these deviations to remove the sign and adding them up yields a measure of the total variation in the sample data. We call it the Total Sum of Squares, or $SSTO$. The total sum of squares is the total amount of variation in the response variable. It is what any potential predictor would seek to explain. Here, the word ``explain" really means ``reduce." To the extent that the total squared error of prediction around a predictor is \emph{less} than $SSTO$, the predictor is effective. It has ``explained" part of the variation in the response variable --- at least in the sense of taking care of it. Now consider a categorical explanatory variable as a predictor of the response variable. This variable (which could be either an experimental treatment or an existing variable that is merely assessed, like breed of dog) subdivides the cases into two or more groups. Now, if you want to predict the response variable, you would use the \emph{group} mean rather than the overall mean. For example, if you want to predict the amount of food eaten by an Irish wolfhound, you would use the mean consumption of the Irish wolfhounds in your sample, not the mean consumption of all the dogs combined. No matter how good a predictor is, it will not be perfect for real data. For each value of the response variable, subtract off the group mean (not the overall mean, this time). Square those errors of prediction, add them up, and we have the Sum of Squared error of prediction Within groups, where the response variable is being predicted from group membership. The initials $SSW$ stand for Sum of Squares Within. This quantity represents the variation in the response variable that is \emph{not} explained by the explanatory variable. It is left over, or \emph{residual}.\footnote{The differences between the data values and group means are \emph{residuals}. In regression, the predictions are points on the regression line or surface, and again the residuals are differences between observed and predicted values. In regression, the initials $SSE$ stand for Sum of Squared Error of prediction. $SSW$ is a special kind of $SSE$.} If $SSTO$ is the total amount of variation that could be explained, and $SSW$ is the amount of variation that is left unexplained, then the difference between them must be the variation that is explained. Now suppose that by some amazing coincidence, all the group means were exactly equal. Then $SSW=SSTO$, and absolutely no variation is explained by the explanatory variable. This suggests that explained variation must be linked to variation between group means, and we write \begin{displaymath} SSTO = SSB + SSW, \end{displaymath} where $SSB$, which stands for ``Sum of Squares Between," is the variation that is explained by the categorical explanatory variable. The notation $SSB$ for the explained sum of squares is supported by a set of formulas, which are given because they may be illuminating for some readers, not because you will ever have to use them for calculation. First, suppose that there are $p$ groups\footnote{This $p$ is different from the $p$-value. It connects so well with standard notation in multiple regression that we're going to use it for the number of groups, even though it's unfortunate when the same symbol is used for two different things. You'll just have to realize which $p$ is meant from the context.}, with $n_j$ cases in each group, $j=1,\ldots,p$. The total sample size is $n = \sum_{j=1}^p n_j$. Observation $i$ in group $j$ is denoted by $Y_{i,j}$, and the sample means are \begin{displaymath} \overline{Y}_j = \frac{\sum_{i=1}^{n_j}Y_{i,j}}{n_j} \mbox{ and } \overline{Y} = \frac{\sum_{j=1}^{p}\sum_{i=1}^{n_j}Y_{i,j}}{n}. \end{displaymath} \pagebreak Then, the formulas for the sums of squares are \begin{eqnarray*} SSB & = & \sum_{j=1}^{p} n_j (\overline{Y}_j-\overline{Y})^2 \\ SSW & = & \sum_{j=1}^{p}\sum_{i=1}^{n_j} (Y_{i,j}-\overline{Y}_j)^2 \\ SSTO & = & \sum_{j=1}^{p}\sum_{i=1}^{n_j} (Y_{i,j}-\overline{Y})^2. \end{eqnarray*} You can see that the Sum of Squares Between groups is literally the variation of the group means around the overall mean, with the contribution of each squared deviation determined by the group sample size. Again, the sums of squares add up: $SSTO = SSB + SSW$. \paragraph{ANOVA summary tables} Sums of squares and related quantities are often presented in an \emph{Analysis of variance summary table}. In the old days, these were given in the results sections of journal articles; today, they appear only in the output printed by statistics packages. There are minor differences in detail. SAS \texttt{proc glm} produces one in this format. {\begin{center} \begin{tabular}{lccccc} & & \texttt{Sum of} & & & \\ \texttt{Source} & \texttt{DF} & \texttt{Squares} & \texttt{Mean Square} & \texttt{F Value} & \texttt{Pr} $>$ \texttt{F} \\ \\ \texttt{Model} & $p-1$ & $SSB$ & $MSB=SSB/(k-1)$ & $MSB/MSW$ & $p$-value \\ \\ \texttt{Error} & $n-p$ & $SSW$ & $MSW=SSW/(n-k)$ & & \\ \\ \texttt{Corrected Total}~~ & $n-1$ & $SSTO$ & & & \\ \\ \end{tabular} \end{center}} \noindent Sums of squares add up, degrees of freedom add up, Mean Square = SS/df, and $F$ is the ratio of two Mean Squares. The $F$ ratio is the test statistic for \begin{displaymath} H_0: \mu_1 = \ldots = \mu_p. \end{displaymath} That is, under the null hypothesis all the population means are equal. For a particular data set, the analysis of variance summary table will be filled with numbers. It allows you to calculate a very useful descriptive statistic: \begin{displaymath} R^2 = \frac{SSB}{SSTO}. \end{displaymath} $R^2$ is the \textbf{proportion of the variation in the response variable that is explained by the explanatory variable}.\footnote{Psychologists often call it the proportion of \emph{variance} that is explained, while statisticians usually call it proportion of sum of squares. The ``proportion of variance" terminology can be justified in a couple of different ways, and is perfectly okay.} This is exactly the interpretation we give to the square of the correlation coefficient; $R^2$ is a reasonable index of how strongly the response variable is related to the explanatory variable. If the sample size is small, it is possible for $R^2$ to be fairly large, but the differences among means are not statistically significant. Or, if the sample size is huge, even a very weak, trivial relationship can be ``significant." To take an extreme example, one fabled analysis of U. S. census data found virtually \emph{everything} to be statistically significant, even average shoe size East versus West of the Mississippi River. You might say that there are really two kinds of significance: statistical significance and \emph{substantive} significance. $R^2$ can help you assess substantive significance. Confidence intervals can be useful, too. What's a good value of $R^2$? Traditions vary in different scientific disciplines. Not surprisingly, areas dominated by noisy data and weak relationships are more tolerant of small $R^2$ values. My personal preference is guided by the correlation coefficient. In a scatterplot, the correlation has to be around 0.30 in absolute value before I can really tell whether the relationship is positive or negative. Since $0.30^2=0.09$, I start taking explanatory variables seriously once they explain around nine or ten percent of the variation (or of the \emph{remaining} variation, if there are multiple explanatory variables). But opinions differ. Cohen's (1988) authoritative \emph{Statistical power analysis for the behavioral sciences}~\cite{spa} suggests a much more modest standard. \section{Testing Contrasts}\label{contrasts} The $F$-test from a one-way ANOVA is useful, but it usually does not tell you all you need to know. For example, if the test is significant, the conclusion is that not all the group means are equal in the population. But you do not know which means are different from each other. Or, specific comparisons might be of interest. For example, you may have reason to believe that the response to drug $A$ is better than the average response to drugs $B$, $C$ and $D$. Fortunately, analysis of variance technology can do much more than simply test for equality of several group means. First, we need a few definitions. A \emph{linear combination} is a weighted sum of several quantities. It has the general form \begin{displaymath} \mbox{Linear Combination} = a_1 Q_1 + a_2 Q_2 + \ldots + a_k Q_p. \end{displaymath} The symbols $a_1$ through $a_p$ stand for numerical constants. We will call these the \emph{weights} of the linear combination. Suppose there are $p$ treatments (groups, values of the categorical explanatory variable, whatever you want to call them). A \textbf{contrast} is a special kind of linear combination of means in which the weights add up to zero. A population contrast has the form \begin{displaymath} c = a_1 \mu_1 + a_2 \mu_2 + \cdots + a_p \mu_p \end{displaymath} where $a_1+a_2+ \cdots + a_p = 0$. The case where all of the $a$ values are zero is uninteresting, and is excluded. A population contrast is estimated by a sample contrast: \begin{displaymath} \widehat{c} = a_1 \overline{Y}_1 + a_2 \overline{Y}_2 + \cdots + a_p \overline{Y}_p. \end{displaymath} With the right software (and that definitely includes SAS), it is easy to test whether any contrast equals zero, and to obtain a confidence interval for a contrast. It is also easy to test several contrasts at once. By setting $a_1 = 1$, $a_2 = -1$, and the rest of the $a$ values to zero we get $L = \overline{Y}_1 - \overline{Y}_2$, so it's easy to see that any difference between two means is a contrast.\footnote{The test of a contrast between two means is not exactly the same as what you would get if you ignored all the data from the other groups, and just did a two-sample $t$-test or a one-way analysis with two groups. This is because the test of a contrast uses data from \emph{all} the groups to estimate the common within-group variance; it uses Mean Squared Within from the full one-way ANOVA.} Also, the average of one set of means minus the average of another set is a contrast. The $F$ test for equality of $p$ means can be viewed as a simultaneous test of $p-1$ contrasts. For example, suppose there are four treatments, and the null hypothesis of the initial test is $H_0:\mu_1=\mu_2=\mu_3=\mu_4$. The table gives the $a_1,a_2,a_3,a_4$ values for three contrasts; if all three contrasts equal zero then the four population means are equal, and \emph{vice versa}. {\begin{center} \begin{tabular}{|l|l|l|l|} \hline $a_1$ & $a_2$ & $a_3$ & $a_4$ \\ \hline 1 & -1 & ~0 & ~0 \\ \hline 0 & ~1 & -1 & ~0 \\ \hline 0 & ~0 & ~1 & -1 \\ \hline \end{tabular} \end{center}} The way you read this table is {\begin{center} \begin{tabular}{c c c c c c c c c} $\mu_1$ & - & $\mu_2$ & & & & & = & 0 \\ & & $\mu_2$ & - & $\mu_3$ & & & = & 0 \\ & & & & $\mu_3$ & - & $\mu_4$ & = & 0 \end{tabular} \end{center}} Clearly, if $\mu_1=\mu_2$ and $\mu_2=\mu_3$ and $\mu_3=\mu_4$, then $\mu_1=\mu_2=\mu_3=\mu_4$, and if $\mu_1=\mu_2=\mu_3=\mu_4$, then $\mu_1=\mu_2$ and $\mu_2=\mu_3$ and $\mu_3=\mu_4$. The simultaneous $F$ test for the three contrasts is 100\% equivalent to what you get from a one-factor ANOVA; it yields the same $F$ statistic, the same degrees of freedom, and the same $p$-value. There is always more than one way to set up the contrasts to test a given hypothesis. Staying with the example of testing differences among four means, we could have specified {\begin{center} \begin{tabular}{|l|l|l|l|} \hline $a_1$ & $a_2$ & $a_3$ & $a_4$ \\ \hline 1 & ~0 & ~0 & -1 \\ \hline 0 & ~1 & ~0 & -1 \\ \hline 0 & ~0 & ~1 & -1 \\ \hline \end{tabular} \end{center}} \noindent so that all the means are equal to the last one,\footnote{These contrasts (differences between means) are actually \emph{equal} to the regression coefficients in a multiple regression with indicator dummy variables, in which the last category is the reference category. More on this later.} and thus equal to each other. No matter how you set up collection of contrasts, if you do it correctly you always get the same test statistic and $p$-value. \section{The Tubes Data}\label{TUBES} In the \emph{tubes data} (kindly provided by Linda Kohn of the University of Toronto's Botany department), the investigators were studying sclerotial fungi. The fungus they were studying is nasty black stuff that looks much like the fungus that grows between the tiles above your bathtub (well, okay, my bathtub). The fungus is called ``sclerotial" because that is how they reproduce. Sclerotia are little pods that produce spores. When the pod opens and the spores are released, they float through the air, land somewhere, and maybe start to grow. Ordinarily, these sclerotial fungi grow on plants. In fact, they often grow on canola plants, and kill them or impair their growth. The canola plant produces a high-quality vegetable oil, and is one of Canada's biggest cash crops. So this makes a difference, because it is about food. All these fungi look the same, but they are not. There are different strains of fungus, and the investigators know how to do genetic fingerprinting to tell them apart. The different types are called ``mycelial compatibility groups" (MCG for short), because if you grow two different genetic types together in a dish, they will separate into two visibly distinct colonies, and stay separated. The stuff that grows together is compatible. Before techniques of genetic fingerprinting were developed, this was the only way to tell the strains of apart. The MCGs are genetically and spatially distinct, but do some grow faster than others? This could have implications for agricultural practice as well as science. In this experiment, the fungus is not growing on plants; it's growing in ``race tubes," in a nutrient solution. The implicit assumption here is that types of fungus that grow better in test tubes will also grow better on plants. Is this true? It's definitely an empirical question, because plants fight off these infestations with something like an immune system response, and the fungus that grows best on a completely passive host is not necessarily the one that will grow best on a host that is fighting back. This is an issue of external validity; see Section~\ref{artifacts}. There are six MCGs, with four test tubes each. So, there are $n=24$ cases in all. This may seem like a very small sample size, and in fact the sample size was not chosen by a power analysis (see Section~\ref{defs} in Chapter~\ref{rmethods} for a brief discussion) or any other systematic method. It was entirely intuitive --- but this is the intuition of scientists with well-deserved international reputations in their field. Here's how they thought about it. The samples of each fungus type are genetically identical, the test tubes in which they are placed are exactly identical, and the nutrient solution in the tubes comes from one well-mixed batch; it's exactly the same in all tubes. The amount of nutrient solution in each tube is placed by hand, but it's done \emph{very} carefully, by highly trained and experienced personnel. The temperature and humidity of the tubes in the lab are also carefully controlled, so they are the same, except for microscopic differences. Really, the only possible source of variation in measured growth (except for very tiny errors of measurement) is the genetic makeup of the fungus. Under the circumstance, one tube for each fungus type might seem adequate to a biologist (though you couldn't do any significance tests), two tubes would be replicating the study, and four tubes per condition might seem like overkill.\footnote{It is true that with this small sample, the assumptions of normal distribution and equal variance are basically uncheckable. But they can be justified as follows. The only reason that the length measurement for a particular type of fungus would not be completely identical would be a multitude of tiny, more or less independent random shocks arising from tiny errors of measurement (the lab assistant is using a ruler) and even smaller differences in the chemical composition of the nutrient solution and micro-climate within the growth chamber. These random shocks may not be identically distributed, but as long as they are independent and fairly numerous, a version of the Central Limit Theorem assures us that their sum is normally distributed. Also, since code numbers were used to label the test tubes (the lab assistants were blind to experimental condition), there is no reason to expect that the nature of the random shocks would differ for the different fungus types. This justifies the assumption of equal variances.} We will see presently that this intuition is supported by how the statistical analysis turned out. Every day for two weeks, a lab assistant (maybe a graduate student) measured each tube, once in the morning and once in the evening. She measured the length of fungus in centimeters, and also counted the sclerotia, as well as taking other measurements. We will confine ourselves to a single response variable -- length of the fungus on the evening of day 10. After that point, the fastest-growing strains spread past the end of the test tubes, creating a pattern of missing data that is too challenging to be considered here. So, we have fungus type, a categorical explanatory variable called \texttt{MCG} that takes on six values (the codes are numerical, and they are informative to the botanists); and we have the single response variable \texttt{pmlng10}, which roughly indicates growth rate. The The SAS program \texttt{tubes09f.sas} contains a one-way analysis of variance with many (not all) of the bells and whistles. The strategy will to present the complete SAS program first and then go over it piece by piece and explain what is going on -- with one major statistical digression. Here is the program. \begin{verbatim} /*************** tubes09f.sas ****************/ /* One-way analysis of tubes data */ /*********************************************/ %include '/folders/myfolders/tuberead2.sas'; title2 'One-way analysis of tubes data'; proc freq; tables mcg; proc glm; title3 'Just the defaults'; class mcg; model pmlng10 = mcg; /* For convenience, MCGs are: 198 205 213 221 223 225 */ proc glm; title3 'With contrasts and multiple comparisons'; class mcg; model pmlng10 = mcg / clparm; /* clparm give CI for contrasts down in the estimate statement. */ means mcg; /* Test custom contrasts, or "planned comparisons" */ contrast '198vs205' mcg 1 -1 0 0 0 0; contrast "223vs225" mcg 0 0 0 0 1 -1; contrast '223n225vsRest' mcg -1 -1 -1 -1 2 2; /* Test equality of mcgs excluding 198: a COLLECTION of contrasts */ contrast 'AllBut198' mcg 0 1 -1 0 0 0, mcg 0 0 1 -1 0 0, mcg 0 0 0 1 -1 0, mcg 0 0 0 0 1 -1; /* Replicate overall F test just to check. */ contrast 'OverallF=76.70' mcg 1 -1 0 0 0 0, mcg 0 1 -1 0 0 0, mcg 0 0 1 -1 0 0, mcg 0 0 0 1 -1 0, mcg 0 0 0 0 1 -1; /* Estimate will print the value of a sample contrast and do a t-test of H0: Contrast = 0 */ /* F = t-squared */ estimate '223n225vsRest' mcg -.25 -.25 -.25 -.25 .5 .5; estimate 'AnotherWay' mcg -3 -3 -3 -3 6 6 / divisor=12; /* Multiple Comparisons */ means mcg / Tukey Bon Scheffe; /* Simultaneous Confidence Intervals */ /* Tables of adjusted p-values -- more convenient */ lsmeans mcg / pdiff adjust=bon; lsmeans mcg / pdiff adjust=tukey; lsmeans mcg / pdiff adjust=scheffe; /* Get Scheffe critical value from proc iml */ proc iml; title2 'Scheffe critical value for all possible contrasts'; numdf = 5; /* Numerator degrees of freedom for initial test */ dendf = 17; /* Denominator degrees of freedom for initial test */ alpha = 0.05; critval = finv(1-alpha,numdf,dendf); scrit = critval * numdf; print "Initial test has" numdf " and " dendf "degrees of freedom." "----------------------------------------------------------" "Using significance level alpha = " alpha "------------------------------------------------" "Critical value for the initial test is " critval "------------------------------------------------" "Critical value for Scheffe tests is " scrit "------------------------------------------------"; \end{verbatim} The program begins with \texttt{\%include~'/folders/myfolders/tuberead2.sas';} the data step is contained in a separate file called \texttt{tuberead2.sas}, not shown here. The \texttt{\%include} statement reads in the external file. This is what was done with the \texttt{statclass} data presented in Section~\ref{statclass} of Chapter~\ref{sas}. More detail about \texttt{\%include} is given there. Then (after the second title line) we request a frequency distribution of the explanatory variable -- always a good idea. \begin{verbatim} proc freq; tables mcg; \end{verbatim} Here is the output of \texttt{proc freq}. \begin{verbatim} Fungus Tube data with line1=113 eliminated 1 One-way analysis of tubes data The FREQ Procedure Mycelial Compatibility Group Cumulative Cumulative mcg Frequency Percent Frequency Percent -------------------------------------------------------- 198 4 17.39 4 17.39 205 4 17.39 8 34.78 213 3 13.04 11 47.83 221 4 17.39 15 65.22 223 4 17.39 19 82.61 225 4 17.39 23 100.00 \end{verbatim} The first line of the title contains a reminder that one of the cases (tubes) has been eliminated from the data. In the full data set, there was an outlier; when the biologists saw it, they were absolutely convinced that in spite of the great care taken in the laboratory, the tube in question had been contaminated with the wrong strain of fungus. So we set it aside. This is why there are only three test tubes in the \texttt{mcg=213}, group, and four in all the others. Next, we have a bare-bones \texttt{proc glm}. The initials stand for ``General Linear Model," and indeed the procedure is very general. Especially in this first example, we are just scratching the surface. All the parts are obligatory except \texttt{title3}, which produces a third title line that is displayed only for the output of this procedure. \begin{verbatim} proc glm; title3 'Just the defaults'; class mcg; model pmlng10 = mcg; \end{verbatim} The \texttt{class} statement declares package to be categorical. Without it, \texttt{proc glm} would do a regression with \texttt{mcg} as a quantitative explanatory variable. The syntax of the minimal \texttt{model} statement is \begin{verse} \texttt{model} Response variable(s) \texttt{=} Explanatory variable(s)\texttt{;} \end{verse} Here is the output; it's part of the output file. \begin{verbatim} _______________________________________________________________________________ Fungus Tube data with line1=113 eliminated 2 One-way analysis of tubes data Just the defaults The GLM Procedure Class Level Information Class Levels Values mcg 6 198 205 213 221 223 225 Number of Observations Read 23 Number of Observations Used 23 _______________________________________________________________________________ Fungus Tube data with line1=113 eliminated 3 One-way analysis of tubes data Just the defaults The GLM Procedure Dependent Variable: pmlng10 Sum of Source DF Squares Mean Square F Value Pr > F Model 5 55.43902174 11.08780435 76.70 <.0001 Error 17 2.45750000 0.14455882 Corrected Total 22 57.89652174 R-Square Coeff Var Root MSE pmlng10 Mean 0.957554 1.500224 0.380209 25.34348 Source DF Type I SS Mean Square F Value Pr > F mcg 5 55.43902174 11.08780435 76.70 <.0001 Source DF Type III SS Mean Square F Value Pr > F mcg 5 55.43902174 11.08780435 76.70 <.0001 \end{verbatim} First, \texttt{proc glm} gives ``Class Level Information: " the name of the explanatory variable, the number of ``Levels" (groups), and the actual values taken on by the explanatory variable. Then we get the sample size ($n=23$). That's all for Page 2 of the output. If not for the \texttt{formdlim} option, SAS would print the next page of output on a new physical sheet of paper. On the next page of output (that is, the next \emph{logical} page, as opposed to physical page), SAS first prints the title lines, then the name of the response variable, and the first of three analysis of variance summary tables. It's a standard one, and leads to the $F$ value of 76.70; this is the ``numerical value of the test statistic (so often requested in homework problems) for testing equality of means. The $p$-value is tiny ($p<0.0001$). The differences among means are statistically significant, but with this minimal output we cannot even guess which means might be significantly different from which others; the sample means are not even displayed. On the other hand, we do get some other statistics. Reading from right to left, we see the sample mean of the response variable, \texttt{Root MSE} (literally the square root of the Mean Square Within groups), The Coefficient of Variation (100 times \texttt{Root MSE} divided by $\overline{Y}$, for what that's worth), and \begin{displaymath} R^2 = \frac{SSB}{SSTO} = \frac{55.4390}{57.8965} = 0.957554. \end{displaymath} That is, nearly 96\% of the variation in growth rate is explained by genetic the type of the fungus. This is an overwhelmingly strong relationship between the explanatory and response variables, and completely justifies the investigators' judgement that a small sample was all they needed. You'd never see anything this strong outside the laboratory (say, in a canola field). Next in the SAS program comes the \emph{real} \texttt{proc glm} --- one that illustrates testing and confidence intervals for contrasts, and also multiple comparisons (sometimes called \emph{post hoc} tests, or \emph{probing}). It starts like the one we've just examined. \begin{verbatim} /* For convenience, MCGs are: 198 205 213 221 223 225 */ proc glm; title3 'With contrasts and multiple comparisons'; class mcg; model pmlng10 = mcg / clparm; /* clparm give CI for contrasts down in the estimate statement. */ means mcg; \end{verbatim} The comment lists the \texttt{mcg}s (values of the explanatory variable) in order; it's useful here for setting up contrasts and remembering what they mean. This \texttt{proc glm} starts out just like the last one, except for the \texttt{clparm} option on the \texttt{model} statement; \texttt{clparm} stands for ``confidence limits for parameters." The parameters in question are contrasts (which are actually \emph{functions} of several model parameters), requested later in the \texttt{estimate} statements. This is the best way to obtain confidence intervals for contrasts. There's also an optional means statement that goes \texttt{means mcg}. It requests a display of the sample means of the response variable, separately for each value of the explanatory variable named. A \texttt{means} statement is really necessary in any oneway ANOVA with \texttt{proc glm} if you are to have any idea of what is going on. But the SAS \emph{syntax} does not require it. Here is the table of means generated by the means statement. \begin{verbatim} The GLM Procedure Level of -----------pmlng10----------- mcg N Mean Std Dev 198 4 28.3250000 0.35939764 205 4 25.8500000 0.28867513 213 3 25.0000000 0.26457513 221 4 23.4000000 0.48304589 223 4 24.8000000 0.16329932 225 4 24.6000000 0.54772256 \end{verbatim} Next, we request test of some contrasts, and also tests of two \emph{collections} of contrasts. As the comment in the program indicates, these are sometimes called ``planned comparisons" of treatment means. The implication is that they are tests of specific hypotheses that were developed before looking at the data -- maybe the hypotheses that the study was designed to test in the first place. Maybe. \begin{verbatim} /* Test custom contrasts, or "planned comparisons" */ contrast '198vs205' mcg 1 -1 0 0 0 0; contrast "223vs225" mcg 0 0 0 0 1 -1; contrast '223n225vsRest' mcg -1 -1 -1 -1 2 2; /* Test equality of mcgs excluding 198: a COLLECTION of contrasts */ contrast 'AllBut198' mcg 0 1 -1 0 0 0, mcg 0 0 1 -1 0 0, mcg 0 0 0 1 -1 0, mcg 0 0 0 0 1 -1; /* Replicate overall F test just to check. */ contrast 'OverallF=76.70' mcg 1 -1 0 0 0 0, mcg 0 1 -1 0 0 0, mcg 0 0 1 -1 0 0, mcg 0 0 0 1 -1 0, mcg 0 0 0 0 1 -1; \end{verbatim} The syntax of the \texttt{contrast} statement is (reading left to right): \begin{enumerate} \item The word \texttt{contrast} \item A label for the contrast (or set of contrasts), enclosed in single or double quotation marks \item The name of the categorical explanatory variable. If there is more than one categorical explanatory variable (factor), you'll get a contrast of the \emph{marginal} means averaging across the other factors. \item The weights of the contrast --- the constants $a_1, \ldots, a_p$ described in Section~\ref{contrasts}. \item If you want to test more than one contrast simultaneously, separate the contrasts by commas, as in the example. You must repeat the name of the categorical explanatory variable each time. \item End the statement with a semicolon, as usual. \end{enumerate} If the weights $a_1, \ldots, a_p$ do not add up to zero, you won't get a test of whether the resulting linear combination equals zero. You don't even get an error message or warning, just a "Note" on the log file saying something like ``CONTRAST LC is not estimable." This actually makes perfectly good sense if you understand the way that \texttt{proc glm} parameterizes linear models that have categorical explanatory variables. But the waters are a bit deep here, so we'll let it go for now. The output of the contrast statement comes after the ANOVA summary table and after the output of the means statement (and \texttt{lsmeans}), even if you request means after you've requested contrasts. They are nicely labelled, using the labels supplied in the \texttt{contrast} statements. Naturally, the overall $F$ value of 76.70 appearing in the label of the last test was obtained in an earlier run. \begin{verbatim} The GLM Procedure Dependent Variable: pmlng10 Contrast DF Contrast SS Mean Square F Value Pr > F 198vs205 1 12.25125000 12.25125000 84.75 <.0001 223vs225 1 0.08000000 0.08000000 0.55 0.4671 223n225vsRest 1 4.62182432 4.62182432 31.97 <.0001 AllBut198 4 12.39526316 3.09881579 21.44 <.0001 OverallF=76.70 5 55.43902174 11.08780435 76.70 <.0001 \end{verbatim} Next we have the \texttt{estimate} statement, which has a syntax similar to \texttt{contrast}. It is limited to single contrasts. They have to be actual contrasts, and not just generic linear combinations of cell means. The \texttt{estimate} statement prints the value of the sample contrast, a number that is an \emph{estimate} of the population contrast. You also get a two-sided $t$-test of the null hypothesis that the contrast equals zero in the population. This is equivalent to the $F$-test generated by \texttt{contrast}; $F=t^2$, and the $p$-values are identical. Notice that if you are just interested in a test for whether a contrast equals zero, multiplying by a constant has no effect -- so the test of $-0.5,-0.5,1.0$ is the same as the test for $1,1,-2$; you'd probably use \texttt{contrast}. But if you are using \texttt{estimate}, you probably are interested in the numerical value of the contrast, often the difference between two means or averages of means. Some of these can be awkward to specify in decimal form, so you can use integers and give a divisor, as shown below. \begin{verbatim} /* Estimate will print the value of a sample contrast and do a t-test of H0: Contrast = 0 */ /* F = t-squared */ estimate '223n225vsRest' mcg -.25 -.25 -.25 -.25 .5 .5; estimate 'AnotherWay' mcg -3 -3 -3 -3 6 6 / divisor=12; \end{verbatim} Here is the output of \texttt{estimate}. As mentioned earlier, the confidence limits were produced by the \texttt{clparm} option on the \texttt{model} statement. \begin{verbatim} Standard Parameter Estimate Error t Value Pr > |t| 223n225vsRest -0.94375000 0.16690623 -5.65 <.0001 AnotherWay -0.94375000 0.16690623 -5.65 <.0001 Parameter 95% Confidence Limits 223n225vsRest -1.29589137 -0.59160863 AnotherWay -1.29589137 -0.59160863 \end{verbatim} \section{Multiple Comparisons}\label{MULTIPLECOMPARISONS} The \texttt{means} statement of \texttt{proc glm} lets you look at the group means, but it does not tell you which means are significantly different from which other means. Before we lose control and start doing all possible $t$-tests, consider the following. \paragraph{The curse of a thousand $t$-tests} Significance tests are supposed to help screen out random garbage, so we can disregard ``trends" that could easily be due to chance. But all the common significance tests are designed in isolation, as if each one were the only test you would ever be doing. The chance of getting significant results when nothing is going on may well be about 0.05, depending on how well the assumptions of the test are met. But suppose you do a \emph{lot} of tests on a data set that is purely noise, with no true relationships between any explanatory variable and any response variable. Then the chances of false significance mount up. It's like looking for your birthday in tables of stock market prices. If you look long enough, you will find it. This problem definitely applies when you have a significant difference among more than two treatment means, and you want to know which ones are different from each other. For example, in an experiment with 10 treatment conditions (this is not an unusually large number, for real experiments), there are 45 pairwise differences among means. In the tubes data, there are 6 different fungus types, and thus 15 potential pairwise comparisons. You have to pity the poor scientist\footnote{Let's use the term ``scientist" generously to apply to anyone trying to obtain informmation from a set of numerical data.} who learns about this and is honest enough to take the problem seriously. On one hand, good scientific practice and common sense dictate that if you have gone to the trouble to collect data, you should explore thoroughly and try to learn something from the data. But at the same time, it appears that some stern statistical entity is scolding you, and saying that you're naughty if you peek. There are several ways to resolve the problem. One way is to basically ignore it, while perhaps acknowledging that it is there. According to this point of view, well, you're crazy if you don't explore the data. Maybe the true significance level for the entire process is greater than 0.05, but still the use of significance tests is a useful way to decide which results might be real. Nothing's perfect; let's carry on. My favourite solution is to collect enough data so that they can be randomly split into an exploratory and a replication sample. You explore one of the samples thoroughly, doing all sorts of tests, maybe re-defining the variables in the process. The result is a set of very specific hypotheses. Then you test the hypotheses on the second data set. This is great, unless the data are very time-consuming or expensive to collect. In that case, you're lucky to have one small data set, and you have to use all of it at once or you won't have enough power to detect anything. Taking this unfortunate reality into account, statisticians have looked for ways that significance tests can be modified to allow for the fact that we're doing a lot of them. What we want are methods for holding the chances of false significance to a single low level for a \emph{set} of tests, simultaneously. The general term for such methods is \textbf{multiple comparison} procedures. Often, when a significance test (like a one-way ANOVA) tests several things simultaneously and turns out to be significant, multiple comparison procedures are used as a second step, to investigate where the effect came from. In cases like this, the multiple comparisons are called \textbf{follow-up} tests, or \textbf{post hoc} tests, or sometimes \textbf{probing}. It is generally acknowledged that multiple comparison methods are often helpful (even necessary) for following up significant $F$-tests in order to see where an effect comes from. For now, let's concentrate on following up a significant $F$ test in a one-way analysis of variance. Three approaches will be presented, named after their originators: Bonferroni\footnote{Actually, Mr. Bonferroni is only indirectly responsible for the Bonferroni method of multiple comparisons. He gets credit for the probability inequality that says $P(\cup_{j=1}^k A_j) \leq \sum_{j=1}^k P(A_j)$. Letting $A_j$ be the event that null hypothesis $j$ is rejected (assume they are all true), we get the Bonferroni multiple comparison method quite easily.}, Tukey and Scheff\'e. There are many more. \subsection{Bonferroni} The Bonferroni method is very general, and extends far beyond pairwise comparisons of means. It is a simple correction that can be applied when you are performing multiple tests, and you want to hold the chances of false significance to a single low level for all the tests simultaneously. \emph{It applies when you are testing multiple sets of explanatory variables, multiple response variables, or both.} The Bonferroni correction consists of simply dividing the desired significance level (that's $\alpha$, the maximum probability of getting significant results when actually nothing is happening, usually $\alphaÊ=Ê0.05$) by the number of tests. In a way, you're splitting the alpha equally among the tests you do. For example, if you want to perform 5 tests at joint significance level 0.05, just do everything as usual, but only declare the results significant at the \emph{joint} 0.05 level if one of the tests gives you $pÊ<Ê0.01 $ (0.01Ê=Ê0.05/5). If you want to perform 20 tests at joint significance level 0.05, do the individual tests and calculate individual $p$-values as usual, but only believe the results of tests that give $pÊ<Ê0.0025$ (0.0025Ê=Ê0.05/20). Say something like ``Protecting the 20 tests at joint significance level 0.05 by means of a Bonferroni correction, the difference in reported liking between worms and spinach souffl\'e was the only significant food category effect." The Bonferroni correction is conservative. That is, if you perform 20 tests, the probability of getting significance at least once just by chance with a Bonferroni correction is less than or equal to 0.05 -- almost always less. The big advantages of the Bonferroni approach are simplicity and flexibility. It is the only way I know to analyze quantitative and categorical response variables simultaneously. The main disadvantages of the Bonferroni approach are \begin{enumerate} \item \emph{You have to know how many tests you want to perform in advance, and you have to know what they are.} In a typical data analysis situation, not all the significance tests are planned in advance. The results of one test will give rise to ideas for other tests. If you do this and then apply a Bonferroni correction to all the tests that you happened to do, it no longer protects all the tests simultaneously at the level you want\footnote{On the other hand, you could randomly split your data into an exploratory sample and a replication sample. Test to your heart's content on the first sample, without any correction for multiple testing. Then, when you think you know what your results are, perform only those tests on the replication sample, and protect them simultaneously with a Bonferroni correction. This could be called "Bonferroni-protected cross-validation." It sounds good, eh? This will be illustrated using the Math data described at the end of Chapter~\ref{sas}}. \item \emph{The Bonferroni correction can be too conservative,} especially when the number of tests becomes large. For example, to simultaneously test all 780 correlations in a 40 by 40 correlation matrix at joint $\alphaÊ=Ê0.05$, you'd only believe correlations with $pÊ<Ê0.0000641 = 0.05/780$. Is this ``too" conservative? Well, with $n$ = 200 in that 40 by 40 example, you'd need $r$ = 0.27 for significance (compared to $r$ = .14 with no correction). With $n$ = 100 you'd need $r$ = .385, or about 14.8\% of one variable explained by another \emph{single} variable. Is this too much to ask? You decide. \end{enumerate} \subsection{Tukey} This is Tukey's Honestly Significant Difference (HSD) method. It is not his Least Significant Different (LSD) method, which has a better name but does not really get the job done. Tukey tests apply only to pairwise differences among means in ANOVA. It is based on a deep study of the probability distribution of the difference between the largest sample mean and the smallest sample mean, assuming the population means are in fact all equal. \begin{itemize} \item If you are interested in all pairwise differences among means and nothing else, and if the sample sizes are equal, Tukey is the best (most powerful) test, period. \item If the sample sizes are unequal, the Tukey tests still get the job of simultaneous protection done, but they are a bit conservative. When sample sizes are unequal, Bonferroni or Scheff\'e can sometimes be more powerful. \end{itemize} \subsection{Scheff\'e} \label{ONEWAYSCHEFFE} It is very easy for me to say too much about Scheff\'e tests, so this discussion will be limited to testing whether certain linear combinations of treatment means (in a one-way design) are significantly different from zero. The Scheff\'e tests allow testing whether \emph{any} contrast of treatment means differs significantly from zero, with the tests for all possible contrasts simultaneously protected. When asked for Scheff\'e followups to a one-way ANOVA, SAS tests all pairwise differences between means, but \emph{there are infinitely many more contrasts in the same family that it does not do} --- and they are all jointly protected against false significance at the 0.05 level. You can do as many of them as you want easily, with SAS and a calculator. It's a miracle. You can do infinitely many tests, all simultaneously protected. You do not have to know what they are in advance. It's a license for unlimited data fishing, at least within the class of contrasts of treatment means. Two more miracles: \begin{itemize} \item If the initial one-way ANOVA is not significant, it's \emph{impossible} for any of the Scheff\'e follow-ups to be significant. This is not quite true of Bonferroni or Tukey. \item If the initial one-way ANOVA \emph{is} significant, there \emph{must} be a single contrast that is significantly different from zero. It may not be a pairwise difference, you may not think of it, and if you do find one it may not be easy to interpret, but there is at least one out there. Well, actually, there are infinitely many, but they may all be extremely similar to one another. \end{itemize} Here's how you do it. First find the critical value of $F$ for the initial oneway ANOVA (Recall that if a test statistic is greater than the critical value, it's statistically significant). This is part of the default output from \texttt{proc glm} when you request Scheff\'e tests using the \texttt{means} statement -- or you can use \texttt{proc iml}\footnote{Or, you could even use a table of critical values in the back of a Statistics text book. The exact degrees of freedom you want probably won't be in there, so you'll have to interpolate. Yuk.}. A contrast is significantly different from zero by a Scheff\'e test if the $F$ statistic is greater than the usual critical value \emph{multiplied by $p-1$}, where $p$ is the number of groups. You can get the $F$ statistics with \texttt{contrast}. Keep doing tests until you run out of ideas. Notice that multiplying by the number of means (minus one) is a kind of penalty for the richness of the infinite family of tests you could do. As soon as Mr. Scheff\'e discovered these tests, people started complaining that the penalty was very severe, and it was too hard to get significance. In my opinion, what's remarkable is not that a license for unlimited fishing is expensive, but that it's for sale at all. The power of a Scheff\'e test is the probability of getting a value of $F$ that is bigger than the critical value \emph{multiplied by $p-1$}. You can pay for it by increasing the sample size. \paragraph{Which method should you use?} In most practical data analysis situations, you would only use one of the three multiple comparison methods. Here are some guidelines. \begin{itemize} \item If the sample sizes are nearly equal and you are only interested in pairwise comparisons, use Tukey because it's most powerful in this situation. \item If the sample sizes are not close to equal and you are only interested in pairwise comparisons, there is (amazingly, just this once) no harm in applying all three methods and picking the one that gives you the greatest number of significant results. This is because you \emph{could} calculate the three types of adjusted critical value in advance before seeing the data, and choose the smallest one. \item If you are interested in testing contrasts that go beyond pairwise comparisons and you can specify \emph{all} of them (exactly what they are, not just how many) before seeing the data, Bonferroni is almost always more powerful than Scheff\'e. Tukey is out, because it applies only to pairwise comparisons. \item If you want lots of special contrasts but you don't know exactly what they all are, Scheff\'e is the only honest way to go, unless you have a separate replication data set. \end{itemize} \subsection{Simultaneous confidence intervals and adjusted $p$-values} The Bonferroni and Scheff\'e methods allow you to test an arbitrary family of contrasts simultaneously, while holding down the \emph{joint} Type~I error rate. If you want to test a contrast that is a little special or unusual, you'd use the test from the \texttt{contrast} or \texttt{estimate} statement, along with an adjusted critical value. But if you're only interested in comparing all possible pairs of group means, you don't have to specify all those contrasts; SAS does it for you. Two equivalent formats are available, simultaneous confidence intervals and adjusted $p$-values. \emph{Equivalent} means that both methods label exactly the same differences as significant;the only difference is in how the results are printed. \paragraph{Simultaneous confidence intervals} When you invoke multiple comparisons using the \texttt{means} statement (this is the older way), as in \begin{verbatim} means package / Tukey Bon Scheffe; \end{verbatim} you get our three favourite kinds of multiple comparisons for all pairwise differences among means. (SAS is not case sensitive, so capitalizing the names is not necessary.) The multiple comparisons are presented in the form of simultaneous confidence intervals. If the 95\% confidence interval does not include zero, the test (Bonferroni, Tukey or Scheff\'e) is significant at the joint 0.05 level. The confidence intervals are correct, but they are ugly to look at and not recommended. No output from the command above will be shown. \paragraph{Adjusted $p$-values} Adjusted $p$-values are adjusted for the fact that you are doing multiple tests; you believe the results when the adjusted $p$-value is less than 0.05. The adjustment is easy to describe for the Bonferroni method; just multiply the ordinary $p$-value by the number of tests, and if the resulting value is more than one, call it 1.00. For the Scheff\'e method, divide the computed value of $F$ by $p-1$; the Scheff\'e adjusted $p$-value is the tail area of the $F$ distribution above this value. % Checked this with mcg 223 vs 225, gettig an adjusted p-value of 0.9885722 from % R with v <- 0.55/5; 1-pf(v,5,17) - compare 0.9884 from lsmeans I don't know exactly how the Tukey $p$-value adjustment works, but if you really need to know you can look it up in the SAS documentation. While the \texttt{means} statement allows you to request several different multiple comparison methods at once, \texttt{lsmeans} must be invoked separately for each method you want. Here is the syntax. \begin{verbatim} lsmeans mcg / pdiff adjust=bon; lsmeans mcg / pdiff adjust=tukey; lsmeans mcg / pdiff adjust=scheffe; \end{verbatim} The keyword \texttt{lsmeans} stands for ``least squares means," which are the group means adjusted for one or more quantitative explanatory variables (covariates). Since there are no quantitative explanatory variables here, the least squares means are the same as ordinary means.\footnote{Least squares means will be explained properly in a later chapter, using concepts from multiple regression.} The syntax of the \texttt{lsmeans} is (reading from left to right) \begin{itemize} \item \texttt{lsmeans} \item The name of the explanatory variable \item A slash; options are given to the right of the slash. \item \texttt{pdiff} requests a table of $p$-values for testing all pairwise differences between means. \item \texttt{adjust=} and the name of the method. Use ``bon" or ``Bon" instead of the full name. \end{itemize} Here is the Scheff\'e output. First we get the (least squares) means, and then a table showing the adjusted $p$-values. The number in row $j$, column $k$ contains the adjusted $p$-value for the test of mean $j$ against mean $k$. \begin{verbatim} The GLM Procedure Least Squares Means Adjustment for Multiple Comparisons: Scheffe pmlng10 LSMEAN mcg LSMEAN Number 198 28.3250000 1 205 25.8500000 2 213 25.0000000 3 221 23.4000000 4 223 24.8000000 5 225 24.6000000 6 Least Squares Means for effect mcg Pr > |t| for H0: LSMean(i)=LSMean(j) Dependent Variable: pmlng10 i/j 1 2 3 4 5 6 1 <.0001 <.0001 <.0001 <.0001 <.0001 2 <.0001 0.1854 <.0001 0.0381 0.0101 3 <.0001 0.1854 0.0021 0.9918 0.8559 4 <.0001 <.0001 0.0021 0.0037 0.0142 5 <.0001 0.0381 0.9918 0.0037 0.9884 6 <.0001 0.0101 0.8559 0.0142 0.9884 \end{verbatim} For comparison, here is the table of adjusted $p$-values for the Tukey method. \begin{verbatim} i/j 1 2 3 4 5 6 1 <.0001 <.0001 <.0001 <.0001 <.0001 2 <.0001 0.0838 <.0001 0.0122 0.0026 3 <.0001 0.0838 0.0005 0.9808 0.7392 4 <.0001 <.0001 0.0005 0.0008 0.0039 5 <.0001 0.0122 0.9808 0.0008 0.9732 6 <.0001 0.0026 0.7392 0.0039 0.9732 \end{verbatim} You can see that the Tukey $p$-values are almost all smaller than the Scheff\'e $p$-values, except when the values are near one. This is to be expected; the Tukey method is theoretically more powerful because the sample sizes are almost equal. Still, the two methods point to exactly the same conclusions for these particular data (and so does the Bonferroni method). How would you \emph{describe} these conclusions? This is the answer to the standard question ``Which means are different from each other?" or just ``What do you conclude?" If the question asks for ``plain, non-statistical language," then you don't mention the multiple comparison method at all. Otherwise, you should add something like ``These conclusions are based on a set of Bonferroni multiple comparisons using a joint 0.05 significance level." But how much detail do you give, and what do you say? You can see that the Tables of adjusted $p$-values may be almost okay for a technical audience, but one can do a lot better. Here is an example. The format is based on one that SAS produces in connection with some multiple comparison methods you seldom want to do. Curiously, it is not available with \texttt{lsmeans}. I started by editing the list of means from \texttt{lsmeans} to put them in numerical order. \pagebreak The table below shows mean length on the evening of day 10. Means that are not significantly different by a Scheff\'e test are connected by a common letter. \begin{verbatim} mcg Mean Length on Day 10 (pm) 198 28.3250000 205 25.8500000 a 213 25.0000000 a b 223 24.8000000 b 225 24.6000000 b 221 23.4000000 \end{verbatim} Here are the conclusions in plain language. \begin{enumerate} \item \texttt{mcg} 198 grows fastest. \item \texttt{mcg} 221 grows slowest. \item We cannot conclude that the growth rates of \texttt{mcg}s 205 and 213 are different. \item \texttt{mcg} 205 grows faster than \texttt{mcg}s 221, 223 and 225. \item \texttt{mcg} 213 grows faster than 221, but there is not enough evidence to conclude that it is different from 223 or 225. \item There is little difference between the growth rates of \texttt{mcg}s 223 and 225. \end{enumerate} This example illustrates something that can be a source of discomfort. The conclusions of multiple significance tests, even when they are multiple comparisons, need not be logically consistent with one another. Here, growth for mcg 205 is not different from 213, and 213 is not different from 223 --- but 205 \emph{is} different from 223. All I can say is that it would be worse if you were formally accepting the null hypothesis. Another weird thing is that it's mathematically possible for the overall $F$ test to be significant, so you conclude that the population means are not all equal. But then \emph{none} of the pairwise comparisons are significant, no matter what multiple comparison method you use. Ouch. If you plan to use Scheff\'e's method to test contrasts other than (or in addition to) pairwise comparisons, it helps to have the adjusted critical value in front of you. Then you can just compare the $F$ values from your \texttt{contrast} statements to the critical value. You could do it with a table of the $F$ distribution and a calculator, but \texttt{proc iml} (which stands for ``Interactive Matrix Language," and is very powerful) is more convenient, because the critical value appears on your output. Here is the code. \begin{verbatim} proc iml; title3 'Scheffe critical value for all possible contrasts'; numdf = 5; /* Numerator degrees of freedom for initial test */ dendf = 17; /* Denominator degrees of freedom for initial test */ alpha = 0.05; critval = finv(1-alpha,numdf,dendf); scrit = critval * numdf; print "Initial test has" numdf " and " dendf "degrees of freedom." "----------------------------------------------------------" "Using significance level alpha = " alpha "------------------------------------------------" "Critical value for the initial test is " critval "------------------------------------------------" "Critical value for Scheffe tests is " scrit "------------------------------------------------"; \end{verbatim} And here is the output. \begin{verbatim} Scheffe critical value for all possible contrasts numdf dendf Initial test has 5 and 17 degrees of freedom. ---------------------------------------------------------- alpha Using significance level alpha = 0.05 ------------------------------------------------ critval Critical value for the initial test is 2.8099962 ------------------------------------------------ scrit Critical value for Scheffe tests is 14.049981 ------------------------------------------------ \end{verbatim} \subsection{Scheff\'e tests for \emph{collections} of contrasts} \label{SCHEFFECONTRASTS} Scheff\'e tests actually protect a family of tests that include tests for infinitely many \emph{collections} of contrasts, not just single contrasts. Suppose the initial $F$ test is significant, and you have a follow-up null hypothesis saying that $s$ non-redundant\footnote{Linearly independent.} contrasts all equal zero. In the \texttt{TUBES} example, such a null hypothesis would be that the population means for all MCGs except 198 are equal -- in other words, the test of whether the MCGs other than 198 have different growth rates. This involves $s=4$ contrasts. We did it as a one-at-a-time test in \texttt{tubes09f.sas}; the contrast was named \texttt{AllBut198}. To convert such a ``planned" comparison to a Scheff\'e test, just use the adjusted critical value \begin{equation}\label{scrit} f_{Sch} = f_{crit} \frac{p-1}{s}, \end{equation} where $f_{crit}$ is the usual critical value for the initial test. Then, considered as a Scheff\'e follow-up, the test is significant at the \emph{joint} 0.05 level if the computed value of $F$ for the collection of contrasts is greater than $f_{Sch}$. For the example of \texttt{AllBut198}, $f_{crit}=2.81,p=6$ and $s=4$. So \begin{displaymath} f_{Sch} = 2.81 \frac{5}{4} = 3.51. \end{displaymath} The test we got from \texttt{contrast} gave us $F=21.44$, which is bigger than 3.51. So we conclude that those other growth rates are not all equal. If you plan to test collections of contrasts with Scheff\'e tests, it is helpful to have a table of all the adjusted critical values you might need. Here is a \texttt{proc~iml} that does the job. The details are not explained, but the code can easily be adapted to fit any example. All you need are the numerator degrees of freedom $(p-1)$ and denominator degrees of freedom $(n-p)$ from an ANOVA summary table. \begin{verbatim} proc iml; title3 'Table of Scheffe critical values for COLLECTIONS of contrasts'; numdf = 5; /* Numerator degrees of freedom for initial test */ dendf = 17; /* Denominator degrees of freedom for initial test */ alpha = 0.05; critval = finv(1-alpha,numdf,dendf); zero = {0 0}; S_table = repeat(zero,numdf,1); /* Make empty matrix */ /* Label the columns */ namz = {"Number of Contrasts in followup test" " Scheffe Critical Value"}; mattrib S_table colname=namz; do i = 1 to numdf; s_table(|i,1|) = i; s_table(|i,2|) = numdf/i * critval; end; reset noname; /* Makes output look nicer in this case */ print "Initial test has" numdf " and " dendf "degrees of freedom." "Using significance level alpha = " alpha; print s_table; \end{verbatim} Here is the output. \begin{verbatim} Table of Scheffe critical values for COLLECTIONS of contrasts Initial test has 5 and 17 degrees of freedom. Using significance level alpha = 0.05 Number of Contrasts in followup test Scheffe Critical Value 1 14.049981 2 7.0249904 3 4.683327 4 3.5124952 5 2.8099962 \end{verbatim} When you do Scheff\'e tests for collections of contrasts, several comforting rules apply. \begin{itemize} \item If the initial test is not significant, it's a mathematical fact that no test for a collection of contrasts can be significant by a Scheff\'e test, so don't even bother. \item Suppose the Scheff\'e test for a collection is significant. Now consider the collection of all single contrasts that are equal to zero if all members of the collection equal zero\footnote{Technically, the set of all vectors of weights that lie in the linear subspace spanned by the weights of the collection.}. The Scheff\'e test for at least one of those contrasts will be significant --- if you can find it. \item Suppose the Scheff\'e test for a collection of $s$ contrasts is \emph{not} significant. If the truth of $H_0$ for the collection implies that a contrast is equal to zero, then the Scheff\'e test for that contrast cannot be significant either. \item The last point applies to smaller collections of contrasts, that is, to collections involving fewer than $s$ contrasts. \end{itemize} \subsection{Proper Follow-ups}\label{PROPERFOLLOWUPS} We will describe a set of tests as \emph{proper follow-ups} to to an initial test if \begin{enumerate} \item The null hypothesis of the initial test logically implies the null hypotheses of all the tests in the follow-up set. \item All the tests are jointly protected against Type I error (false significance) at a known significance level, usually $\alpha=0.05$. \end{enumerate} The first property requires explanation. First, consider that the Tukey tests, which are limited to pairwise differences between means, automatically satisfy this, because if all the population means are equal, then each pair is equal to each other. But it's possible to make mistakes with Bonferroni and Scheff\'e if you're not careful. Here's why the first property is important. Suppose the null hypothesis of a follow-up test \emph{does} follow logically from the null hypothesis of the initial test. Then, if the null hypothesis of the follow-up is false (there's really something going on), then the null hypothesis of the initial test must be incorrect too, and this is one way in which the initial null hypothesis is false. Thus if we correctly reject the follow-up null hypothesis, we have uncovered one of the ways in which the initial null hypothesis is false. In other words, we have (partly, perhaps) identified where the initial effect comes from. On the other hand, if the null hypothesis of a potential follow-up test is \emph{not} implied by the null hypothesis of the initial test, then the truth or untruth of the follow-up null hypothesis does not tell us \emph{anything} about the null hypothesis of the initial test. They are in different domains. For example, suppose we conclude $2\mu_1$ is different from $3\mu_2$. Great, but if we want to know how the statement $\mu_1=\mu_2=\mu_3$ might be wrong, it's irrelevant. If you stick to testing contrasts as a follow-up to a one-way ANOVA, you're fine. This is because if a set of population means are all equal, then any contrast of those means is equal to zero. That is, the null hypothesis of the initial test automatically implies the null hypotheses of any potential follow-up test, and everything is okay. Furthermore, if you try to specify a linear combination that is not a contrast with the \texttt{contrast} statement of \texttt{proc glm}, SAS will just say something like \texttt{NOTE: CONTRAST SOandSO is not estimable} in the log file. There is no other error message or warning; the test just does not appear in your output file. %two-tailed example some day. % Do a quick non-parametric? \chapter{More Than One Explanatory Variable at a Time}\label{berkeley} The standard elementary tests typically involve one explanatory variable and one response variable. Now we will see why this can make them very misleading. The lesson you should take away from this discussion is that when important variables are ignored in a statistical analysis --- particularly in an observational study --- the result can be that we draw incorrect conclusions from the data. Potential confounding variables really need to be included in the analysis. \section{The chi-squared test of independence} In order to make sure the central example in this chapter is clear, it may be helpful to give a bit more background on the common Pearson chi-square test of independence. As stated earlier, the chi-square test of independence is for judging whether two categorical variables are related or not. It is based upon a \emph{cross-tabulation}, or \emph{joint frequency distribution} of the two variables. For example, suppose that in the statclass data, we are interested in the relationship between sex and apparent ethnic background. If the ratio of females to males depended upon ethnic background, this could reflect an interesting cultural difference in sex roles with respect to men and women going to university (or at least, taking Statistics classes). In \texttt{statmarks1.sas}, we did this test and obtained a chisquare statistic of 2.92 (df=2, $p=0.2321$), which is not statistically significant. Now we'll do it just a bit differently to illustrate the details. First, here is the program \texttt{ethsex.sas}. \begin{verbatim} /* ethsex.sas */ %include '/folders/myfolders/statread.sas'; title2 'Sex by Ethnic'; proc freq; tables sex*ethnic / chisq norow nocol nopercent expected; \end{verbatim} \noindent And here is the output. \pagebreak \begin{verbatim} _______________________________________________________________________________ Grades from STA3000 at Roosevelt University: Fall, 1957 1 Sex by Ethnic 19:55 Tuesday, August 30, 3005 The FREQ Procedure Table of sex by ethnic sex ethnic(Apparent ethnic background (ancestry)) Frequency| Expected |Chinese |European|Other | Total ---------+--------+--------+--------+ Male | 27 | 7 | 5 | 39 | 25.79 | 9.4355 | 3.7742 | ---------+--------+--------+--------+ Female | 14 | 8 | 1 | 23 | 15.21 | 5.5645 | 2.2258 | ---------+--------+--------+--------+ Total 41 15 6 62 Statistics for Table of sex by ethnic Statistic DF Value Prob ------------------------------------------------------ Chi-Square 2 2.9208 0.2321 Likelihood Ratio Chi-Square 2 2.9956 0.2236 Mantel-Haenszel Chi-Square 1 0.0000 0.9949 Phi Coefficient 0.2170 Contingency Coefficient 0.2121 Cramer's V 0.2170 WARNING: 33% of the cells have expected counts less than 5. Chi-Square may not be a valid test. Sample Size = 62 \end{verbatim} In each cell of the table, we have an observed frequency and an expected frequency. The expected frequency is the frequency one would expect by chance if the two variables were completely unrelated.\footnote{The formula for the expected frequency in a given cell is (row~total)~$\times$~(column~total)/(sample~size). This follows from the definition of independent events given in introductory probability: the events $A$ and $B$ are independent if $P(A\cap B)=P(A)P(B)$. But this is too much detail, and we're not going there.} If the observed frequencies are different enough from the expected frequencies, one would tend to disbelieve the null hypothesis that the two variables are unrelated. But how should one measure the difference, and what is the meaning of different ``enough?" The Pearson chi-square statistic (named after Karl Pearson, a famous racist, uh, I mean statistician) is defined by \begin{equation} \label{chisq} \chi^2 = \sum_{\mbox{\scriptsize{cells}}}\frac{(f_o-f_e)^2}{f_e}, \end{equation} where $f_o$ refers to the observed frequence, $f_e$ refers to expected frequency, and as indicated, the sum is over all the cells in the table. If the two variables are really independent, then as the total sample size increases, the probability distribution of this statistic approaches a chisquare with degrees of freedom equal to (Number of rows - 1)$\times$(Number of columns - 1). Again, this is an approximate, large-sample result, one that obtains exactly only in the limit as the sample size approaches infinity. A traditional ``rule of thumb" is that the approximation is okay if no expected frequency is less than five. This is why SAS gave us a warning. More recent research suggests that to avoid inflated Type~I error (false significance at a rate greater than 0.05), all you need is for no expected frequency to be less than one. You can see from formula~(\ref{chisq}) why an expected frequency less than one would be a problem. Division by a number close to zero can yield a very large quantity even when the observer and expected frequencies are fairly close, and the so-called chisquare value will be seriously inflated. Anyway, The $p$-value for the chisquare test is the upper tail area, the area under the chi-square curve beyond the observed value of the test statistic. In the example from the statclass data, the test was not significant and we conclude nothing. \section{The Berkeley Graduate Admissions data} Now we're going to look at another example, one that should surprise you. In the 1970's the University of California at Berkeley was accused of discriminating against women in graduate admissions. Data from a large number of applicants are available. The three variables we will consider are sex of the person applying for graduate study, department to which the person applied, and whether or not the person was admitted. First, we will look at the table of sex by admission. \pagebreak \begin{verbatim} Table of sex by admit sex admit Frequency| Row Pct |No |Yes | Total ---------+--------+--------+ Male | 1493 | 1198 | 2691 | 55.48 | 44.52 | ---------+--------+--------+ Female | 1278 | 557 | 1835 | 69.65 | 30.35 | ---------+--------+--------+ Total 2771 1755 4526 The FREQ Procedure Statistics for Table of sex by admit Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 92.2053 <.0001 \end{verbatim} It certainly looks suspicious. Roughly forty-five percent of the male applicants were admitted, compared to thirty percent of the female applicants. This difference in percentages (equivalent to the relationship between variables here) is highly significant; with $n=4526$, the $p$-value is very close to zero. \section{Controlling for a variable by subdivision} However, things look different when we take into account the department to which the person applied. Think of a \emph{three-dimensional} table in which the rows are sex, the columns are admission, and the third dimension (call it layers) is department. Such tables are easy to generate with SAS and other statistical packages. The three-dimensional table is displayed by printing each layer on a separate page, along with test statistics (if requested) for each sub-table. This is equivalent to dividing the cases into sub-samples, and doing the chisquare test separately for each sub-sample. A useful way to talk about this is to say that that we are \emph{controlling} for the third variable; that is, we are looking at the relationship between the other two variables with the third variable held constant. We will have more to say about controlling for collections of explanatory variables when we get to regression. Here are the six sub-tables of sex by admit, one for each department, with a brief comment after each table. The SAS output is edited a bit to save paper. \begin{verbatim} Table 1 of sex by admit Controlling for dept=A sex admit Frequency| Row Pct |No |Yes | Total ---------+--------+--------+ Male | 313 | 512 | 825 | 37.94 | 62.06 | ---------+--------+--------+ Female | 19 | 89 | 108 | 17.59 | 82.41 | ---------+--------+--------+ Total 332 601 933 Statistics for Table 1 of sex by admit Controlling for dept=A Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 17.2480 <.0001 \end{verbatim} For department $A$, 62\% of the male applicants were admitted, while 82\% of the female applicants were admitted. That is, women were \emph{more} likely to get in than men. This is a \emph{reversal} of the relationship that is observed when the data for all departments are pooled! \pagebreak \begin{verbatim} Table 2 of sex by admit Controlling for dept=B sex admit Frequency| Row Pct |No |Yes | Total ---------+--------+--------+ Male | 207 | 353 | 560 | 36.96 | 63.04 | ---------+--------+--------+ Female | 8 | 17 | 25 | 32.00 | 68.00 | ---------+--------+--------+ Total 215 370 585 Statistics for Table 2 of sex by admit Controlling for dept=B Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 0.2537 0.6145 \end{verbatim} For department $B$, women were somewhat more likely to be admitted (another reversal), but it's not statistically significant. \pagebreak \begin{verbatim} Table 3 of sex by admit Controlling for dept=C sex admit Frequency| Row Pct |No |Yes | Total ---------+--------+--------+ Male | 205 | 120 | 325 | 63.08 | 36.92 | ---------+--------+--------+ Female | 391 | 202 | 593 | 65.94 | 34.06 | ---------+--------+--------+ Total 596 322 918 Statistics for Table 3 of sex by admit Controlling for dept=C Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 0.7535 0.3854 \end{verbatim} For department $C$, men were slightly more likely to be admitted, but the 3\% difference is much smaller than we observed for the pooled data. Again, it's not statistically significant. \pagebreak \begin{verbatim} Table 4 of sex by admit Controlling for dept=D sex admit Frequency| Row Pct |No |Yes | Total ---------+--------+--------+ Male | 279 | 138 | 417 | 66.91 | 33.09 | ---------+--------+--------+ Female | 244 | 131 | 375 | 65.07 | 34.93 | ---------+--------+--------+ Total 523 269 792 Statistics for Table 4 of sex by admit Controlling for dept=D Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 0.2980 0.5852 \end{verbatim} For department $D$, women were a bit more likely to be admitted (a reversal), but it's far from statistically significant. Now department $E$: \pagebreak \begin{verbatim} Table 5 of sex by admit Controlling for dept=E sex admit Frequency| Row Pct |No |Yes | Total ---------+--------+--------+ Male | 138 | 53 | 191 | 72.25 | 27.75 | ---------+--------+--------+ Female | 299 | 94 | 393 | 76.08 | 23.92 | ---------+--------+--------+ Total 437 147 584 Statistics for Table 5 of sex by admit Controlling for dept=E Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 1.0011 0.3171 \end{verbatim} This time it's a non-significant tendency for men to get in more. Finally, department $F$: \begin{verbatim} Table 6 of sex by admit Controlling for dept=F sex admit Frequency| Row Pct |No |Yes | Total ---------+--------+--------+ Male | 351 | 22 | 373 | 94.10 | 5.90 | ---------+--------+--------+ Female | 317 | 24 | 341 | 92.96 | 7.04 | ---------+--------+--------+ Total 668 46 714 Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 0.3841 0.5354 \end{verbatim} For department $F$, women were slightly more likely to get in, but once again it's not significant. So in summary, the pooled data show that men were more likely to be admitted to graduate study. But when take into account the department to which the student is applying, there is a significant relationship between sex and admission for only one department, and in that department, women are more likely to be accepted. How could this happen? I generated two-way tables of sex by department and department by admit; both relationships were highly significant. Instead of displaying the SAS output, I have assembled some numbers from these two tables. The same thing could be accomplished with SAS \texttt{proc tabulate}, but it's too much trouble, so I did it by hand. \begin{table}% [here] \label{getin} \caption{Percentage of female applicants and overall percentage of applicants accepted for six departments} \begin{center} \begin{tabular}{|c|c|c|} \hline Department & Percent applicants female & Percentage applicants accepted \\ \hline $A$ & 11.58\% & 64.42\% \\ \hline $B$ & 4.27 & 63.25 \\ \hline $C$ & 64.60 & 35.08 \\ \hline $D$ & 47.35 & 33.96 \\ \hline $E$ & 67.29 & 25.17 \\ \hline $F$ & 47.76 & 6.44 \\ \hline \end{tabular} \end{center} \end{table} Now it is clear. The two departments with the lowest percentages of female applicants ($A$ and $B$) also had the highest overall percentage of applicants accepted, while the department with the highest percentage of female applicants ($E$) also had the second-lowest overall percentage of applicants accepted. That is, the departments most popular with men were easiest to get into, and those most popular with women were more difficult. Clearly, this produced the overall tendency for men to be admitted more than women. By the way, does this mean that the University of California at Berkeley was \emph{not} discriminating against women? By no means. Why does a department admit very few applicants relative to the number who apply? Because they do not have enough professors and other resources to offer more classes. This implies that the departments popular with men were getting more resources, relative to the level of interest measured by number of applicants. Why? Maybe because men were running the show. The ``show," by the way definitely includes the U. S. military, which funds a lot of engineering and similar stuff at big American universities. The Berkeley data, a classic example of \emph{Simpson's paradox}, illustrate the following uncomfortable fact about observational studies. When you include a new variable in an analysis, the results you have could get weaker, they could get stronger, or they could reverse direction --- all depending upon the inter-relations of the explanatory variables. Basically, if an observational study does not include every potential confounding variable you can think of, there is going to be trouble.\footnote{And even if you \emph{do} include all the potential confounding variables, there is trouble if those confounding variables are measured with error. More on this in a moment.} Now, the distinguishing feature of the ``elementary" tests is that they all involve one explanatory variable and one response variable. Consequently, they can be \emph{extremely} misleading when applied to the data from observational studies, and are best used as tools for preliminary exploration. \paragraph{Pooling the chi-square tests} When using sub-tables to control for a categorical explanatory variable, it is helpful to have a single test that allows you to answer a question like this: If you control for variable $A$, is $B$ related to $C$? For the chi-square test of independence, it's quite easy. Under the null hypothesis that $B$ is unrelated to $C$ for each value of $A$, the test statistics for the sub-tables are independent chisquare random variables. Therefore, there sum is also chisquare, with degrees of freedom equal to the sum of degrees of freedom for the sub-tables. In the Berkeley example, we have a pooled chisquare value of \begin{displaymath} 17.2480+0.2537+0.7535+0.2980+1.0011+0.3841 = 19.9384 \end{displaymath} with 6 degrees of freedom. Using any statistics text (except this one), we can look up the critical value at the 0.05 significance level. It's 12.59; since 19.9 > 12.59, the pooled test is significant at the 0.05 level. To get a $p$-value for our pooled chisquare test, we can use SAS. See the program in the next section. In summary, we need to use statistical methods that incorporate more than one explanatory variable at the same time; multiple regression is the central example. But even with advanced statistical tools, the most important thing in any study is to collect the right data in the first place. Looking at it the right way is critical too, but no statistical analysis can compensate for having the wrong data. For more detail on the Berkeley data, see the 1975 article in \emph{Science} by Bickel Hammel and O'Connell~\cite{berk}. For the principle of adding chisquare values and adding degrees of freedom from sub-tables, a good reference is Feinberg's (1977) \emph{The analysis of cross-classified categorical data}~\cite{feinberg}. \section{The SAS program} Here is the program \texttt{berkeley.sas}. It has several features that you have not seen yet, so a discussion follows the listing of the program. \pagebreak \begin{verbatim} /*************************** berkeley.sas *********************************/ title 'Berkeley Graduate Admissions Data: '; proc format; value sexfmt 1 = 'Female' 0 = 'Male'; value ynfmt 1 = 'Yes' 0 = 'No'; data berkley; input line sex dept $ admit count; %$ format sex sexfmt.; format admit ynfmt.; datalines; 1 0 A 1 512 2 0 B 1 353 3 0 C 1 120 4 0 D 1 138 5 0 E 1 53 6 0 F 1 22 7 1 A 1 89 8 1 B 1 17 9 1 C 1 202 10 1 D 1 131 11 1 E 1 94 12 1 F 1 24 13 0 A 0 313 14 0 B 0 207 15 0 C 0 205 16 0 D 0 279 17 0 E 0 138 18 0 F 0 351 19 1 A 0 19 20 1 B 0 8 21 1 C 0 391 22 1 D 0 244 23 1 E 0 299 24 1 F 0 317 ; proc freq; tables sex*admit / nopercent nocol chisq; tables dept*sex / nopercent nocol chisq; tables dept*admit / nopercent nocol chisq; tables dept*sex*admit / nopercent nocol chisq; weight count; /* Get p-value */ proc iml; x = 19.9384; pval = 1-probchi(x,6); print "Chisquare = " x "df=6, p = " pval; \end{verbatim} The first unusual feature of \texttt{berkeley.sas} is in spite of recommendations to the contrary in Chapter~\ref{sas}, the data are in the program itself rather than in a separate file. The data are in the data step, following the \texttt{datalines} command and ending with a semicolon. You can always do this, but usually it's a bad idea; here, it's a good idea. This is why. I did not have access to a raw data file, just a 2 by 6 by 2 table of sex by department by admission. So I just created a data set with 24 lines, even though there are 4526 cases. Each line of the data set has values for the three variables, and also a variable called \texttt{count}, which is just the observed cell frequency for that combination of sex, department and admission. Then, using the \texttt{weight} statement in \texttt{proc freq}, I just ``weighted" each of the 24 cases in the data file by \texttt{count}, essentially multiplying the sample size by count for each case. The advantages are several. First, such a data set is easy to create from published tables, and is much less trouble than a raw data file with thousands of cases. Second, the data file is so short that it makes sense to put it in the data set for portability and ease of reference. Finally, this is the way you can get the data from published tables (which may not include any significance tests at all) into SAS, where you can compute any statistics you want, including sophisticated analyses based on log-linear models. The last \texttt{tables} statement in the \texttt{proc freq} gives us the three-dimensional table. For a two-dimensional table, the first variable you mention will correspond to rows and the second will correspond to columns. For higher-dimensional tables, the second-to-last variable mentioned is rows, the last is columns, and combinations of the variables listed first are the control variables for which sub-tables are produced. Finally, the \texttt{iml} in \texttt{proc iml} stands for ``Interactive Matrix Language," and you can use it to perform useful calculations in a syntax that is very similar to standard matrix algebra notation; this can be very convenient when formulas you want to compute are in that notation. Here, we're just using it to calculate the area under the curve of the chisquare density with 6 degrees of freedom, beyond the observed test statistic of 19.9384. The \texttt{probchi} function is the cumulative distribution function of the chisquare distribution; the second argument (6 in this case) is the degrees of freedom. \texttt{probchi($x$,6)} gives the area under the curve between zero and $x$, and \texttt{1-probchi($x$,6)} gives the tail area above $x$ -- that is, the $p$-value. \paragraph{Summary} The example of the Berkeley graduate admissions data teaches us that potential confounding variables need to be explicitly included in a statistical analysis. Otherwise, the results can be very misleading. In the Berkeley example, first we ignored department and there was a relationship between sex and admission that was statistically significant in one direction. Then, when we \emph{controlled} for department --- that is, when we took it into account --- the relationship was either significant in the opposite direction, or it was not significant (depending on which department). We also saw how to pool chi-square values and degrees of freedom by adding over sub-tables, obtaining a useful test of whether two categorical variables are related, while controlling for one or more other categorical variables. This is something SAS will not do for you, but it's easy to do with \texttt{proc freq} output and a calculator. \paragraph{Measurement Error} In this example, the confounding variable Department was measured without error; there was no uncertainty about the department to which the student applied. But sometimes, categorical explanatory variables are subject to \emph{classification error}. That is. the actual category to which a case belongs may not correspond to what's in your data file. For example, if you want to ``control" for whether people have ever been to prison and you determine this by asking them, what you see is not necessarily what you get. The rule, which applies to all sorts of measurement error and to all sorts of statistical analysis, is simple, and very unpleasant. If you want to test explanatory variable $A$ controlling for $B$, and \begin{itemize} \item $B$ is related to the response variable, \item $A$ and $B$ are related to each other, and \item $B$ is measured with error, \end{itemize} then the results you get from standard methods do not quite work. In particular, when there is really \emph{no} relationship between $A$ and the response variable for any value of $B$ (the null hypothesis is true), can will still reject the null hypothesis more than 5\% of the time. In fact, the chance of false significance may approach 1.00 (not 0.05) for large samples. Full details are given in a 2009 article by Brunner and Austin~\cite{mereg}. We will return to this ugly truth in connection with multiple regression. \chapter{Multiple Regression} \label{REGRESSION} % Need to replace the textbook example with Math. % Notation: Let's say we're testing r constraints, possibly following up with s < r. This is good fr the scheffe tests I think. My Scheffe section has d initial tests, but fix it. r is the number of rows. Chase this down in the slides too (too bad). \section{Three Meanings of Control} In this course, we will use the word \textbf{control} to refer to procedures designed to reduce the influence of extraneous variables on our results. The definition of extraneous is ``not properly part of a thing," and we will use it to refer to variables we're not really interested in, and which might get in the way of understanding the relationship between the explanatory variable and the response variable. There are two ways an extraneous variable might get in the way. First, it could be a confounding variable -- related to both the explanatory variable and the response variable, and hence capable of creating masking or even reversing relationships that would otherwise be evident. Second, it could be unrelated to the explanatory variable and hence not a confounding variable, but it could still have a substantial relationship to the response variable. If it is ignored, the variation that it could explain will be part of the "background noise," making it harder to see the relationship between explanatory variable and response variable, or at least causing it to appear relatively weak, and possibly to be non-significant. The main way to control potential extraneous variables is by holding them constant. In \textbf{experimental control}, extraneous variables are literally held constant by the procedure of data collection or sampling of cases. For example, in a study of problem solving conducted at a high school, background noise might be controlled by doing the experiment at the same time of day for each subject (and not when classes are changing). In learning experiments with rats, males are often employed because their behavior is less variable than that of females. And a very good example is provided by the \texttt{TUBES} data of Chapter~\ref{ONEWAY}, where experimental conditions were so tightly controlled that there was practically no available source of variation in growth rate except for the genetic character of the fungus. An alternative to experimental control is \textbf{statistical control}, which takes two main forms. One version, \textbf{subdivision}, is to subdivide the sample into groups with identical or nearly identical values of the extraneous variable(s), and then to examine the relationship between explanatory and response variable separately in each subgroup -- possibly pooling the subgroup analyses in some way. The analysis of the Berkeley graduate admissions data in Chapter~\ref{berkeley} is our prime example. As another example where the relationship of interest is between quantitative rather than categorical variables, the correlation of education with income might be studied separately for men and women. The drawback of this subdivision approach is that if extraneous variables have many values or combinations of values, you need a very large sample. The second form of statistical control, \textbf{model-based} control, is to exploit details of the statistical model to accomplish the same thing as the subdivision approach, but without needing a huge sample size. The primary example is multiple linear regression, which is the topic of this chapter. \section{Population Parameters} Recall we said two variables are ``related" if the distribution of the response variable \emph{depends} on the value of the explanatory variable. Classical regression and analysis of variance are concerned with a particular way in which the explanatory and response variables might be related, one in which the \emph{population mean} of $Y$ depends on the value of $X$. Think of a population histogram manufactured out of a thin sheet of metal. The point (along the horizontal axis) where the histogram balances is called the \textbf{expected value} or population mean; it is usually denoted by $E[Y]$ or $\mu$ (the Greek letter mu). The \emph{conditional} population mean of $Y$ given $XÊ=Êx$ is just the balance point of the conditional distribution. It will be denoted by $E[Y|X=x]$. The vertical bar | should be read as "given." Again, for every value of $X$, there is a separate distribution of $Y$, and the expected value (population mean) of that distribution depends on the value of $X$. Furthermore, that dependence takes a very specific and simple form. When there is only one explanatory variable, the population mean of $Y$ is \begin{equation} E[Y|X=x] = \beta_0 + \beta_1x. \label{simpleregmodel} \end{equation} This is the equation of a straight line. The slope (rise over run) is $\beta_1$ and the intercept is $\beta_0$. If you want to know the population mean of $Y$ for any given $x$ value, all you need are the two numbers $\beta_0$ and $\beta_1$. But in practice, we never know $\beta_0$ and $\beta_1$. To \emph{estimate} them, we use the slope and intercept of the least-squares line: \begin{equation} \widehat{Y} = b_0 + b_1x. \label{simpleyhat} \end{equation} If you want to estimate the population mean of $Y$ for any given $x$ value, all you need are the two numbers $b_0$ and $b_1$, which are calculated from the sample data. This has a remarkable implication, one that carries over into multiple regression. Ordinarily, if you want to estimate a population mean, you need a reasonable amount of data. You calculate the sample mean of those data, and that's your estimate of the population mean. If you want to estimate a \emph{conditional} population mean, that is, the population mean of the conditional distribution of $Y$ given a particular $X=x$, you need a healthy amount of data with that value of $x$. For example, if you want to estimate the average weight of 50 year old women, you need a sample of 50 year old women --- unless you are willing to make some assumptions. What kind of assumptions? Well, the simple structure of (\ref{simpleregmodel}) means that you can use formula (\ref{simpleyhat}) to estimate the population mean of $Y$ for a given value of $X=x$ \emph{without having any data} at that $x$ value. This is not ``cheating," or at any rate, it need not be. If \begin{itemize} \item the $x$ value in question is comfortably within the range of the data in your sample, and if \item the straight-line model is a reasonable approximation of reality within that range, \end{itemize} then the estimate can be quite good. The ability to estimate a conditional population mean without a lot of data at any given $x$ value means that we will be able to control for extraneous variables, and remove their influence from a given analysis without having the massive amounts of data required by the subdivision approach to statistical control. We are getting away with this because we have adopted a \emph{model} for the data that makes reasonably strong assumptions about the way in which the population mean of $Y$ depends on $X$. If those assumptions are close to the truth, then the conclusions we draw will be reasonable. If the assumptions are badly wrong, we are just playing silly games. There is a general principle here, one that extends far beyond multiple regression. \begin{hint}\label{tradeoff} There is a direct tradeoff between amount of data and the strength (restrictiveness) of model assumptions. If you have a lot of data, you do not need to assume as much. If you have a small sample, you will probably have to adopt fairly restrictive assumptions in order to conclude anything from your data. \end{hint} \paragraph{Multiple Regression} Now consider the more realistic case where there is more than one explanatory variable. With two explanatory variables, the model for the population mean of $Y$ is \begin{displaymath} E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_0 + \beta_1x_1 + \beta_2x_2, \end{displaymath} which is the equation of a plane in 3 dimensions ($x_1,x_2,y$). The general case is \begin{displaymath} E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_0 + \beta_1x_1 + \ldots + \beta_{p-1}x_{p-1}, \end{displaymath} which is the equation of a hyperplane in $p$ dimensions. \paragraph{Comments} \begin{itemize} \item Since there is more than one explanatory variable, there is a conditional distribution of $Y$ for every \emph{combination} of explanatory variable values. Matrix notation (boldface) is being used to denote a collection of explanatory variables. \item There are $p-1$ explanatory variables. This may seem a little strange, but we're doing this to keep the notation consistent with that of standard regression texts such as \cite{neter96}. If you want to think of an explanatory variable $X_0=1$, then there are $p$ explanatory variables. \item What is $\beta_0$? It's the height of the population hyperplane when all the explanatory variables are zero, so it's the \emph{intercept}. \item Most regression models have an intercept term, but some do not ($X_0 = 0$); it depends on what you want to accomplish. \item $\beta_0$ is the intercept. We will now see that the other $\beta$ values are slopes. \end{itemize} Consider \begin{displaymath} E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_0 + \beta_1x_1 + \beta_2x_2 +\beta_3x_3 + \beta_4x_4 \end{displaymath} What is $\beta_3$? If you speak calculus, $\frac{\partial}{\partial x_3} E[Y] = \beta_3$, so $\beta_3$ is the rate at which the population mean is increasing as a function of $x_3$, when other explanatory variables are \emph{held constant} (this is the meaning of a partial derivative). If you speak high school algebra, $\beta_3$ is the change in the population mean of $Y$ when $x_3$ is increased by one unit and all other explanatory variables are \emph{held constant}. Look at \begin{equation}\label{holdconst} \begin{array}{lrll} & \beta_0 + \beta_1x_1 + \beta_2x_2 & +\beta_3(x_3+1) & + \beta_4x_4 \\ - & (\beta_0 + \beta_1x_1 + \beta_2x_2 & +\beta_3x_3 & + \beta_4x_4) \end{array} \end{equation} \begin{displaymath} \begin{array}{llll} = & \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 & +\beta_3 & + \beta_4x_4 \\ - & \beta_0 - \beta_1x_1 - \beta_2x_2 - \beta_3x_3 & & - \beta_4x_4 \\ & & & \\ = & \beta_3 & & \end{array} \end{displaymath} The mathematical device of \emph{holding other variables constant} is very important. This is what is meant by statements like ``\textbf{Controlling for} parents' education, parents' income and number of siblings, quality of day care is still positively related to academic performance in Grade 1." We have just seen the prime example of model-based statistical control --- the third type of control in the ``Three meanings of control" section that began this chapter. We will describe the relationship between $X_k$ and $Y$ as \textbf{positive} (controlling for the other explanatory variables) if $\beta_k>0$ and \textbf{negative} if $\beta_k<0$. Recall from Chapter~\ref{ONEWAY} that a quantity (say $w$) is a \textbf{linear combination} of quantities $z_1, z_2$ and $z_3$ if $w=a_1z_1+a_2z_2+a_3z_3$, where $a_1, a_2$ and $a_3$ are constants. Common multiple regression is \emph{linear} regression because the population mean of $Y$ is a linear combination of the $\beta$ values. It does \emph{not} refer to the shape of the curve relating $x$ to $E[Y|X=x]$. For example, \begin{displaymath} \begin{array}{llll} E[Y|X=x] & = & \beta_0 + \beta_1x & \mbox{Simple linear regression} \\ E[Y|X=x] & = & \beta_0 + \beta_1x^2 & \mbox{Also simple linear regression} \\ E[Y|X=x] & = & \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 & \mbox{Polynomial regression -- still linear} \\ E[Y|X=x] & = & \beta_0 + \beta_1x + \beta_2 \cos(1/x) & \mbox{Still linear in the } \beta \mbox{ values} \\ E[Y|X=x] & = & \beta_0 + \beta_1 \cos(\beta_2x) & \mbox{Truly non-linear} \end{array} \end{displaymath} When the relationship between the explanatory and response variables is best represented by a curve, we'll call it \textbf{curvilinear}, whether the regression model is linear or not. All the examples just above are curvilinear, except the first one. Notice that in the polynomial regression example, there is really only one explanatory variable, $x$. But in the regression model, $x$, $x^2$ and $x^3$ are considered to be three separate explanatory variables in a multiple regression. Here, fitting a curve to a cloud of points in two dimensions is accomplished by fitting a hyperplane in four dimensions. The origins of this remarkable trick are lost in the mists of time, but whoever thought of it was having a good day. \section{Estimation by least squares} In the last section, the conditional population mean of the response variable was modelled as a (population) hyperplane. It is natural to estimate a population hyperplane with a sample hyperplane. This is easiest to imagine in three dimensions. Think of a three-dimensional scatterplot, in a room. The explanatory variables are $X_1$ and $X_2$. The $(x_1,x_2)$ plane is the floor, and the value of $Y$ is height above the floor. Each subject (case) in the sample contributes three coordinates $(x_1,x_2,y)$, which can be represented by a soap bubble floating in the air. In simple regression, we have a two-dimensional scatterplot, and we seek the best-fitting straight line. In multiple regression, we have a three (or higher) dimensional scatterplot, and we seek the best fitting plane (or hyperplane). Think of lifting and tilting a piece of plywood until it fits the cloud of bubbles as well as possible. What is the ``best-fitting" plane? We'll use the \textbf{least-squares plane}, the one that minimizes the sum of squared vertical distances of the bubbles from the piece of plywood. These vertical distances can be viewed as errors of prediction. It's hard to visualize in higher dimension, but the algebra is straightforward. Any sample hyperplane may be viewed as an estimate (maybe good, maybe terrible) of the population hyperplane. Following the statistical convention of putting a hat on a population parameter to denote an estimate of it, the equation of a sample hyperplane is \begin{displaymath} \widehat{\beta_0} + \widehat{\beta_1}x_1 + \ldots + \widehat{\beta}_{p-1}x_{p-1}, \end{displaymath} and the error of prediction (vertical distance) is the difference between $y$ and the quantity above. So, the least squares plane must minimize \begin{displaymath} Q = \sum_{i=1}^n \left( y_i - \widehat{\beta_0} - \widehat{\beta_1}x_{i,1} - \ldots - \widehat{\beta}_{p-1}x_{i,p-1} \right)^2 \end{displaymath} over all combinations of $\widehat{\beta_0}, \widehat{\beta_1}, \ldots , \widehat{\beta}_{p-1}$. Provided that no explanatory variable (including the peculiar $X_0=1$) is a perfect linear combination of the others, the $\widehat{\beta}$ quantities that minimize the sum of squares $Q$ exist and are unique. We will denote them by $b_0$ (the estimate of $\beta_0$, $b_1$ (the estimate of $\beta_1$), and so on. Again, \emph{a population hyperplane is being estimated by a sample hyperplane}. \begin{displaymath} \begin{array}{cll} E[Y|\boldsymbol{X}=\boldsymbol{x}] & = & \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 \\ \widehat{Y} & = & b_0 + b_1x_1 + b_2x_2 + b_3x_3 + b_4x_4 \end{array} \end{displaymath} \begin{itemize} \item $\widehat{Y}$ means \emph{predicted} $Y$. It is the height of the best-fitting (least squares) piece of plywood above the floor, at the point represented by the combination of $x$ values. The equation for $\widehat{Y}$ is the equation of the least-squares hyperplane. \item ``Fitting the model" means calculating the $b$ values. \end{itemize} \section{Residuals} \label{RESIDUALS} A \textbf{residual}, or error of prediction, is \begin{displaymath} e_i = Y_i - \widehat{Y}_i. \end{displaymath} The residuals (there are $n$ of them) represent errors of prediction. Each one is the vertical distance of $Y_i$ (the value of the response variable) from the regression hyper-plane. It can be shown that for any regression analysis, the sample mean of the residuals is exactly zero. A positive residual means over-performance (or under-prediction). A negative residual means under-performance. Examination of residuals can reveal a lot, since we can't look at 12-dimensional scatterplots. Single-variable plots of the residuals (histograms, box plots, stem and leaf diagrams etc.) can identify possible outliers. These might reveal data errors or be a source of new ideas. Theoretically, residuals should be normally distributed, though they are not quite independent and do not have equal variances. Testing for normality of residuals is an indirect way of checking the normal assumption of the regression model\footnote{What might a bimodal distribution of residuals indicate?}. It is easy with SAS \texttt{proc univariate}. Application of standard time-series diagnostiics to residuals is promising too. \subsubsection{Outlier Detection} Looking at plots, it is sometimes easy to see residuals that seem very large in absolute value. But this can be a bit subjective, and it would be nice to know exactly what it means for a residual to be ``big." There are various ways to re-scale the residuals, so they have a variance close to one. This way, the value of the residual tells you how many standard deviations it is from the mean. When each residual is divided by its standard error (estimated standard deviation) to standardize, sometimes they are called \emph{Studentized}, because of the connection to Student's $t$ distribution (all the usual $t$-tests are based on normally distributed quantities divided by their standard errors). Here are some typical ways to re-scale residuals, along with fairly standard terminology. Remember that the residuals already have a mean of zero. \begin{itemize} \item \textbf{Standardized residuals}: Calculate the sample standard deviation of the residuals, and divide by that. The resulting variable has a sample mean of zero and a sample variance of one. \item \textbf{Semi-Studentized residuals}: Divide each residual by the square root of Mean Square Error ($MSE$) from the regression. \item \textbf{Studentized residuals}: Theoretically, the variances of the residuals are not all the same. But they are easy to derive. The only problem is that they depend on the unkown parameter $\sigma^2$ the common variance of all the conditional distributions of the response variable in the regression model. So estimate the variance of each residual bt substituting $MSE$ for $\sigma^2$, and divide each residual by the square root of its estimated variance. \item \textbf{Studentized deleted residuals}: These are like Studentized residuals, except that for each observation (case) in the data, the response variable is estimated from all the \emph{other} cases, but \emph{not} the one in question. That is, one performs $n$ regressions\footnote{Not literally. There is a mathematical shortcut.}, leaving out each observation in turn. Then each response variable value is predicted from the other $n-1$ observations. The difference between the observed and predicted $Y_i$ values are called \emph{deleted} residuals. Dividing the deleted residuals by their respective estimated standard deviations, we obtain the \emph{Studentized deleted residuals}. \end{itemize} The Studentized deleted residuals deserve extra discussion, and even a bit of notation. First of all, think of a high-dimensional scatterplot, with a least-squares hyperplane fitting the points as well as possible. Suppose one of the points is extremely far from the plane. It's a true outlier. Not only might the plane be pulled out of an optimal position to accomodate that one point, but the \emph{squared} distance of the point from the plane will still be huge. Thus $MSE$ (roughly, the \emph{average} squared distance of the points from the plane) will be inflated. So an ordinary Studentized residual (with $\sqrt{MSE}$ somewhere in the denominator) might not stand out from the pack as much as it should. But a regression analysis \emph{without} that point would not only have a larger absolute error of prediction for the deleted observaton, but the denominator would be based on a smaller Mean Square Error. This is why the Studentized deleted residual is a promising way to detect potential outliers. Another advantage is that if the statistical assumptions of the regression model are correct, the Studentized deleted residual has a probability distributon that is exactly Student's $t$. Probability statements about the other kinds of re-scaled residual are just approximations. The predicted value of $Y_i$ based on the other $n-1$ observations will be denoted $\widehat{Y}_{i(i)}$. Then the deleted residual may be written \begin{displaymath} d_i = Y_i - \widehat{Y}_{i(i)}. \end{displaymath} The estimated standard deviation of the deleted residual is $s\{d_i\}$; the exact way to calculate it may be left to your favourite software\footnote{Details may be found in almost any Regresssion text, such as Neter et al.'s \emph{Applied linear statistical models.~\cite{neter96}}}. Then the \emph{Studentized} deleted residual is \begin{displaymath} t_i = \frac{d_i}{s\{d_i\}}. \end{displaymath} If the regression model is correct, the Studentized deleted residual has a $t$ distribution with $n-p-1$ degrees of freedom. But what if $t_i$ is very large in absolute value? Maybe the observation really comes from a different population, one where a different regression model applies. Most likely, in this case the expected value (population mean) of the deleted residual would not be zero. So the Studentized deleted residual may be used directly as a test statistic. The null hypothesis is that the regression model is true for observation $i$, and it will be a good, sensitive (powerful) test when the model is true for the other observations, but not observation $i$. So it seems clear what we should do. Compare the absolute value of the Studentized deleted residual to the critical value of a $t$ distribution with $n-p-1$ degrees of freedom. If it's bigger than the critical value, conclude that there's something funny about observation $i$ and look into it more closely. This would be fine if we were only suspicious about one of the $n$ observations, and we had identified it in advance \emph{before} looking at the actual data. But in practice we will be carrying out $n$ non-independent significance tests, and all the discussion of multiple comparisons in Section~\ref{MULTIPLECOMPARISONS} of Chapter~\ref{ONEWAY} (starting on Page \pageref{MULTIPLECOMPARISONS}) applies. The simplest thing to do is to apply a Bonferroni correction, and use the $0.05/n$ significance level in place of the usual $0.05$ level. This means that if the model is correct, the chances of incorrectly designating \emph{one or more} observations as outliers will be less than $0.05$. In summary, we let the software calculate the Studentized deleted residuals. Then we obtain the critical value of a $t$ distribution with $n-p-1$ degrees of freedom at the $0.05/n$ significance level --- easy with \texttt{proc iml}. Then we are concerned about an observation and look into it further if the absolute value of the Studentized deleted residual is bigger than the critical value. This treatment of outlier detection as a multiple comparison problem is satisfying and pretty sophisticated. Studentized deleted residuals have another important application. They are the basis of \emph{prediction intervals}, a topic that will be addressed in Section~\ref{PREDICTIONINTERVALS}. \subsubsection{Plots against other variables} Plot of Y vs Y-hat: corelations cannot be negative, and the square ofthe correlation coefficient is exactly $R^2$. \begin{itemize} \item Single variable plots (histograms, box plots, stem and leaf diagrams etc.) can identify possible outliers. (Data errors? Source of new ideas? What might a bimodal distribution of residuals indicate?) \item Plot (scatterplot) of residuals versus potential explanatory variables not in the model might suggest they be included, or not. How would you plot residuals vs a categorical explanatory variable? \item Plot of residuals vs. variables that are in the model may reveal \begin{itemize} \item Curvilinear trend (may need transformation of $x$, or polynomial regression, or even real non-linear regression) \item Non-constant variance over the range of $x$, so the response variable may depend on the explanatory variable not just through the mean. May need transformation of $Y$, or weighted least squares, or a different model. \end{itemize} \item Plot of residuals vs. $\widehat{Y}$ may also reveal unequal variance. \end{itemize} \section{Prediction Intervals} \label{PREDICTIONINTERVALS} \newpage \section{Categorical Explanatory Variables} \label{DUMMYVARS} \subsection{Indicator Dummy Variables}\label{INDICATORCODING} Explanatory variables need not be continuous -- or even quantitative. For example, suppose subjects in a drug study are randomly assigned to either an active drug or a placebo. Let $Y$ represent response to the drug, and \begin{displaymath} x = \left\{ \begin{array}{ll} % ll means left left 1 & \mbox{if the subject received the active drug, or} \\ 0 & \mbox{if the subject received the placebo.} \end{array} \right. \end{displaymath} The model is $E[Y|X=x] = \beta_0 + \beta_1x$. For subjects who receive the active drug (so $x=1$), the population mean is \begin{displaymath} \beta_0 + \beta_1x = \beta_0 + \beta_1 \end{displaymath} For subjects who receive the placebo (so $x=0$), the population mean is \begin{displaymath} \beta_0 + \beta_1x = \beta_0. \end{displaymath} Therefore, $\beta_0$ is the population mean response to the placebo, and $\beta_1$ is the difference between response to the active drug and response to the placebo. We are very interested in testing whether $\beta_1$ is different from zero, and guess what? We get exactly the same $t$ value as from a two-sample $t$-test, and exactly the same $F$ value as from a one-way ANOVA for two groups. \paragraph{Exercise} Suppose a study has 3 treatment conditions. For example Group 1 gets Drug 1, Group 2 gets Drug 2, and Group 3 gets a placebo, so that the Explanatory Variable is Group (taking values 1,2,3) and there is some Response Variable $Y$ (maybe response to drug again). \begin{quest} Why is $E[Y|X=x] = \beta_0 + \beta_1x$ (with $x$ = Group) a silly model? \end{quest} \begin{answ} Designation of the Groups as 1, 2 and 3 is completely arbitrary. \end{answ} \begin{quest} Suppose $x_1=1$ if the subject is in Group 1, and zero otherwise, and $x_2=1$ if the subject is in Group 2, and zero otherwise, and $E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_0 + \beta_1x_1 + \beta_2 x_2$. Fill in the table below. \end{quest} {\begin{center} \begin{tabular}{|c|c|c|l|} \hline Group & $x_1$ & $x_2$ & $\beta_0 + \beta_1x_1 + \beta_2 x_2$ \\ \hline 1 & & & $\mu_1$ = \\ \hline 2 & & & $\mu_2$ = \\ \hline 3 & & & $\mu_3$ = \\ \hline \end{tabular} \end{center}} \pagebreak \begin{answ} \end{answ} {\begin{center} \begin{tabular}{|c|c|c|l|} \hline Group & $x_1$ & $x_2$ & $\beta_0 + \beta_1x_1 + \beta_2 x_2$ \\ \hline 1 & 1 & 0 & $\mu_1$ = $\beta_0 + \beta_1$ \\ \hline 2 & 0 & 1 & $\mu_2$ = $\beta_0 + \beta_2$ \\ \hline 3 & 0 & 0 & $\mu_3$ = $\beta_0$ \\ \hline \end{tabular} \end{center}} \begin{quest} What does each $\beta$ value mean? \end{quest} \begin{answ} $\beta_0=\mu_3$, the population mean response to the placebo. $\beta_1$ is the difference between mean response to Drug 1 and mean response to the placebo. $\beta_2$ is the difference between mean response to Drug 21 and mean response to the placebo. \end{answ} \begin{quest} Why would it be nice to simultaneously test whether $\beta_1$ and $\beta_2$ are different from zero? \end{quest} \begin{answ} This is the same as testing whether all three population means are equal; this is what a one-way ANOVA does. And we get the same $F$ and $p$ values (not really part of the sample answer). \end{answ} Notice that $x_1$ and $x_2$ contain the same information as the three-category variable Group. If you know Group, you know $x_1$ and $x_2$, and if you know $x_1$ and $x_2$, you know Group. In models with an intercept term, a categorical explanatory variable with $k$ categories is always represented by $k-1$ dummy variables. If the dummy variables are indicators, the category that does not get an indicator is actually the most important. The intercept is that category's mean, and it is called the \textbf{reference category}, because the remaining regression coefficients represent differences between the reference category and the other category. To compare several treatments to a control, make the control group the reference category by \emph{not} giving it an indicator. It is worth noting that all the traditional one-way and higher-way models for analysis of variance and covariance emerge as special cases of multiple regression, with dummy variables representing the categorical explanatory variables. \subsubsection{Add a quantitative explanatory variable} Now suppose we include patient's age in the regression model. When there are both quantitative and categorical explanatory variables, the quantitative variables are often called \emph{covariates}, particularly if the categorical part is experimentally manipulated. Tests of the categorical variables controlling for the quantitative variables are called \emph{analysis of covariance}. The usual practice is to put the covariates first. So, we'll let $X_1$ represent age, and let $X_2$ and $X_3$ be the indicator dummy variables for experimental condition. The model now is that all conditional distributions are normal with the same variance $\sigma^2$, and population mean \begin{displaymath} E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3. \end{displaymath} \begin{quest} Fill in the table. \end{quest} \begin{center} \begin{tabular}{|c|c|c|l|} \hline Group & $x_2$ & $x_3$ &$\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3$ \\ \hline A & & & $\mu_1$ = \\ \hline B & & & $\mu_2$ = \\ \hline Placebo & & & $\mu_3$ = \\ \hline \end{tabular} \end{center} \begin{answ} \end{answ} \begin{center} \begin{tabular}{|c|c|c|l|} \hline Group & $x_2$ & $x_3$ &$\beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3$\\ \hline A & 1 & 0 & $\mu_1$ = $(\beta_0+\beta_2)+\beta_1x_1$ \\ \hline B & 0 & 1 & $\mu_2$ = $(\beta_0+\beta_3)+\beta_1x_1$ \\ \hline Placebo & 0 & 0 & $\mu_3$ = ~~~~~$\beta_0$~~~~~$+\beta_1x_1$ \\ \hline \end{tabular} \end{center} This is a \emph{parallel slopes model}. That is, there is a least-squares regression line for each group, with the same slope $\beta_1$ for each line. Only the intercepts are different. This means that for any fixed value of $x_1$ (age), the differences among population means are the same. For any value of age (that is, holding age constant, or \emph{controlling} for age), the difference between response to Drug $A$ and the placebo is $\beta_2$. And controlling for age), the difference between response to Drug $B$ and the placebo is $\beta_3$. The three group means are equal for each constant value of age if (and only if) $\beta_2=\beta_3=0$. This is the null hypothesis for the analysis of covariance. It is easy (and often very useful) to have more than one covariate. In this case we have parallel planes or hyper-planes. And at any fixed set of covariate values, the distances among hyperplanes correspond exactly to the differences among the intercepts. This means we are usually interested in testing null hypotheses about the regression coefficients corresponding to the dummy variables. \begin{quest} Suppose we want to test the difference between response to Drug $A$ and Drug $B$, controlling for age. What is the null hypothesis? \end{quest} \begin{answ} $H_0: \beta_2=\beta_3$ \end{answ} \begin{quest} Suppose we want to test whether controlling for age, the average response to Drug $A$ and Drug $B$ is different from response to the placebo. What is the null hypothesis? \end{quest} \begin{answ} $H_0: \beta_2+\beta_3=0$ \end{answ} \begin{quest} Huh? Show your work. \end{quest} \begin{answ} \end{answ} \begin{center} \begin{tabular}{l l} & $\frac{1}{2}[\, (\beta_0+\beta_2+\beta_1x_1)+(\beta_0+\beta_3+\beta_1x_1) \,] = \beta_0+\beta_1x_1$ \\ \\ $\iff$ & $\beta_0+\beta_2+\beta_1x_1 + \beta_0+\beta_3+\beta_1x_1 = 2\beta_0+2\beta_1x_1$ \\ \\ $\iff$ & $2\beta_0+\beta_2+\beta_3+2\beta_1x_1 = 2\beta_0+2\beta_1x_1$ \\ \\ $\iff$ & $\beta_2+\beta_3=0$ \end{tabular} \end{center} The symbol $\iff$ means ``if and only if." The arrows can logically be followed in both directions. This last example illustrates several important points. \begin{itemize} \item Contrasts can be tested with indicator dummy variables. \item If there are covariates, the ability to test contrasts \emph{controlling} for the covariates is very valuable. \item Sometimes, the null hypothesis for a contrast of interest might not be what you expect, and you might have to derive it algebraically. This can be inconvenient, and it is too easy to make mistakes. \end{itemize} \subsection{Cell means coding}\label{CELLMEANSCODING} When students are setting up dummy variables for a categorical explanatory variable with $p$ categories, the most common mistake is to define an indicator dummy variable for every category, resulting in $p$ dummy variables rather than $p-1$ --- and of course there is an intercept too, because it's a regression model and regression software almost always includes an intercept unless you explicitly suppress it. But then the $p$ population means are represented by $p+1$ regression coefficients, and mathematically, the representation cannot be unique. In this situation the least-squares estimators are not unique either, and all sorts of technical problems arise. Your software might try to save you by throwing one of the dummy variables out, but which one would it discard? And would you notice that it was missing from your output? Suppose, however, that you used $p$ dummy variables but \emph{no intercept} in the regression model. Then there are $p$ regression coefficients corresponding to the $p$ population means, and all the technical problems go away. The correspondence between regression coefficients and population means is unique, and the model can be handy. In particular, null hypotheses can often be written down immediately without any high school algebra. Here is how it would look for the study with two drugs and a placebo. The conditional population means is \begin{displaymath} E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3, \end{displaymath} and the table of population means has a very simple form: \begin{center} \begin{tabular}{|c|c|c|c|c|} \hline Drug &$x_1$&$x_2$&$x_3$&$\beta_1x_1+\beta_2x_2+\beta_3x_3$ \\ \hline A & 1 & 0 & 0 &$\mu_1=\beta_1$ \\ \hline B & 0 & 1 & 0 &$\mu_2=\beta_2$ \\ \hline Placebo & 0 & 0 & 1 &$\mu_3=\beta_3$ \\ \hline \end{tabular} \end{center} The regression coefficients correspond directly to population (cell) means for any number of categories; this is why it's called \emph{cell means coding}. Contrasts are equally easy to write in terms of $\mu$ or $\beta$ quantities. Cell means coding works nicely in conjunction with quantitative covariates. In the drug study example, represent age by $X_4$. Now the conditional population mean is \begin{displaymath} E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4, \end{displaymath} and the cell means (for any fixed value of age equal to $x_4$) are \begin{center} \begin{tabular}{|c|c|c|c|c|} \hline Drug &$x_1$&$x_2$&$x_3$&$\beta_1x_1+\beta_2x_2+\beta_3x_3+\beta_4x_4$ \\ \hline A & 1 & 0 & 0 &$\beta_1+\beta_4x_4$ \\ \hline B & 0 & 1 & 0 &$\beta_2+\beta_4x_4$ \\ \hline Placebo & 0 & 0 & 1 &$\beta_3+\beta_4x_4$ \\ \hline \end{tabular} \end{center} This is another parallel slopes model, completely equivalent to the earlier one. The regression coefficients for the dummy variables are the intercepts, and because the lines are parallel, the differences among population means at any fixed value of $x_4$ are exactly the differences among intercepts. Note that \begin{itemize} \item It is easy to write the null hypothesis for any contrast of collection of contrasts. Little or no algebra is required. \item This extends to categorical explanatory variables with any number of categories. \item With more than one covariate, we have a parallel planes model, and it is still easy to express the null hypotheses. \item The \texttt{test} statement of \texttt{proc reg} is a particularly handy tool. \end{itemize} \subsection{Effect Coding}\label{EFFECTCODING} In \emph{effect coding} there are $p-1$ dummy variables for a categorical explanatory variable with $p$ categories, and the intercept is included. Effect coding look just like indicator dummy variable coding with an intercept, except that the last (reference) category gets -1 instead of zero. Here's how it looks for the hypothetical drug study. \begin{center} \begin{tabular}{|c|c|c|l|} \hline Group & $x_1$ & $x_2$ &$E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_0+\beta_1x_1+\beta_2x_2$\\ \hline A & ~1 & ~0 & $\mu_1$ = $\beta_0+\beta_1$ \\ \hline B & ~0 & ~1 & $\mu_2$ = $\beta_0+\beta_2$ \\ \hline Placebo & -1 & -1 & $\mu_3$ = $\beta_0-\beta_1-\beta_2$ \\ \hline \end{tabular} \end{center} To see what the regression coefficients mean, first define $\mu$ to be the average of the three population means. Then \begin{displaymath} \mu = \frac{1}{3}(\mu_1+\mu_2+\mu_3) = \beta_0, \end{displaymath} so that the intercept is the mean of population means --- sometimes called the \emph{grand mean}. Now we can see right away that \begin{itemize} \item $\beta_1$ is the difference between $\mu_1$ and the grand mean. \item $\beta_2$ is the difference between $\mu_2$ and the grand mean. \item $-\beta_1-\beta_2$ is the difference between $\mu_3$ and the grand mean. \item Equal population means is equivalent to zero coefficients for all the dummy variables. \item The last category is not a reference category. It's just the category with the least convenient expression for the deviation from the grand mean. \item This pattern holds for any number of categories. \end{itemize} In the standard language of analysis of variance, \emph{effects} are deviations from the grand mean. That's why this dummy variable coding scheme is called ``effect coding." When there is more than one categorical explanatory variable, the average cell mean for a particular category (averaging across other explanatory variables) is called a \emph{marginal mean}, and the so-called \emph{main effects} are deviations of the marginal means from the grand mean; these are represented nicely by effect coding. Equality of marginal means implies that all main effects for the variable are zero, and vice versa. Sometimes, people speak of testing for the ``main effect" of a categorical explanatory variable. This is a loose way of talking, because there is not just one main effect for a variable. There are at least two, one for each marginal mean. Possibly, this use of ``effect" blends the effect of an experimental variable with the technical statistical meaning of effect. However, it's a way of talking that does no real harm, and you may see it from time to time in this text. We will see later that effect coding is very useful when there is more than one categorical explanatory variable and we are interested in \emph{interactions} --- ways in which the relationship of an explanatory variable with the response variable depends on the value of another explanatory variable. Covariates work nicely with effect coding. There is no need to make a table of expected values, unless a question explicitly asks you to do so. For example, suppose you add the covariate $X_1$ = Age to the drug study. The treatment means (which depend on $X_1$ are as follows: \begin{center} \begin{tabular}{|c|c|c|l|} \hline Group & $x_2$ & $x_3$ &$E[Y|\boldsymbol{X}=\boldsymbol{x}] = \beta_0+\beta_1x_1+\beta_2x_2+\beta_3x_3$\\ \hline A & ~1 & ~0 & $\mu_1$ = $\beta_0+\beta_2\mbox{~~~~~~}\,+\beta_1x_1$ \\ \hline B & ~0 & ~1 & $\mu_2$ = $\beta_0+\beta_3\mbox{~~~~~~}\,+\beta_1x_1$ \\ \hline Placebo & -1 & -1 & $\mu_3$ = $\beta_0-\beta_2-\beta_3+\beta_1x_1$ \\ \hline \end{tabular} \end{center} Regression coefficients are deviations from the average conditional population mean (conditional on $x_1$). So, if the regression coefficients for all the dummy variables equal zero, the categorical explanatory variable is unrelated to the response variable, when one controls for the covariates. Finally, it's natural for a student to wonder: What dummy variable coding scheme should I use? Use whichever is most convenient. They are all equivalent, if done correctly. They yield the same test statistics, and the same conclusions. \section{Explained Variation} \label{EXPLAINEDVARIATION} Before considering any explanatory variables, there is a certain amount of variation in the response variable. The sample mean is the value around which the sum of squared errors of prediction is at a minimum, so it's a least squares estimate of the population mean of $Y$ when there are no explanatory variables. We will measure the total variation to be explained by the sum of squared deviations around the mean of the response variable. When we do a regression, variation of the data around the least-squares plane represents errors of prediction. It is variation that is \emph{unexplained} by the regression. But it's always less than the variation around the sample mean (Why? Because the least-squares plane could be horizontal). So, the explanatory variables in the regression have explained \emph{some} of the variation in the response variable. Variation in the residuals is variation that is still \emph{unexplained}. Variation to explain: \textbf{Total Sum of Squares} \begin{displaymath} \mbox{SSTO} = \sum_{i=1}^n (Y_i - \overline{Y})^2 \end{displaymath} Variation that the regression does not explain: \textbf{Error Sum of Squares} \begin{displaymath} \mbox{SSE} = \sum_{i=1}^n (e_i - \overline{e})^2 = \sum_{i=1}^n e_i^2 = \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 \end{displaymath} Variation that is explained: \textbf{Regression (or Model) Sum of Squares} \begin{displaymath} \mbox{SSR} = \sum_{i=1}^n (Y_i - \overline{Y})^2 - \sum_{i=1}^n (Y_i - \widehat{Y}_i)^2 = \sum_{i=1}^n (\widehat{Y}_i - \overline{Y})^2 \end{displaymath} Regression software (including SAS) displays the sums of squares above in an \emph{analysis of variance summary table}. ``Analysis" means to ``split up," and that's what we're doing here --- splitting up the variation in response variable into explained and unexplained parts. {\begin{center} \texttt{Analysis of Variance} \begin{tabular}{c c c c c c} & & & & & \\ & & \texttt{Sum of} & \texttt{Mean} & & \\ \texttt{Source} & \texttt{DF} & \texttt{Squares} & \texttt{Square} & \texttt{F Value} & \texttt{Prob$>$F} \\ & & & & & \\ \texttt{Model} & $p-1$ & $SSR$ & $MSR=SSR/(p-1)$ & $F = \frac{MSR}{MSE}$ & $p$-value \\ \texttt{Error} & $n-p$ & $SSE$ & $MSE=SSE/(n-p)$ & & \\ \texttt{Total} & $n-1$ & $SSTO$ & & & \\ \end{tabular} \end{center}} Variance estimates consist of sums of squares divided by degrees of freedom. ``\texttt{DF}" stands for Degrees of Freedom. Sums of squares and degrees of freedom each add up to Total. The $F$-test is for whether $\beta_1 = \beta_2 = \ldots = \beta_{p-1} = 0$ -- that is, for whether \emph{any} of the explanatory variables makes a difference. The proportion of variation in the response variable that is explained by the explanatory variables (representing \emph{strength of relationship}) is \begin{displaymath} R^2 = \frac{\mbox{SSR}}{\mbox{SSTO}} \end{displaymath} The $R^2$ from a simple regression is the same as the square of the correlation coefficient: $R^2=r^2$. For a general multiple regression, the square of the correlation between the $Y$ and $\widehat{Y}$ (predicted $Y$) values is also equal to $R^2$. What is a good value of $R^2$? Well, the weakest relationship I can visually perceive in a scatterplot is around $r=.3$, so I am unimpressed by $R^2$ values under 0.09. By this criterion, most published results in the social sciences, and many published results in the biological sciences are not strong enough to be scientifically interesting. But this is just my opinion. \section{Testing for Statistical Significance in Regression} We are already assuming that there is a separate population defined by each combination of values of the explanatory variables (the conditional distributions of $Y$ given $\mathbf{X}$), and that the conditional population mean is a linear combination of the $\beta$ values; the weights of this linear combination are 1 for $\beta_0$, and the $x$ values for the other $\beta$ values. The classical assumptions are that in addition, \begin{itemize} \item Sample values of $Y$ represent independent observations, conditionally upon the values of the explanatory variables. \item Each conditional distribution is normal. \item Each conditional distribution has the same population variance. \end{itemize} How important are the assumptions? Well, important for what? The main thing we want to avoid is incorrect $p$-values, specifically ones that appear smaller than they are -- so that we conclude a relationship is present when really we should not. This "Type~I error" is very undesirable, because it tends to load the scientific literature with random garbage. For large samples, the assumption of normality is not important provided no single observation has too much influence. What is meant by a ``large" sample? It depends on how severe the violations are. What is ``too much" influence? The influence of the most influential observation must tend to zero as the sample size approaches infinity. You're welcome. The assumption of equal variances can be safely violated provided that the numbers of observations at each combination of explanatory variable values are large and close to equal. This is most likely to be the case with designed experiments having categorical explanatory variables. The assumption of independent observations is very important, almost always. Examples where this does not hold is if a student takes a test more than once, members of the same family respond to the same questionnaire about eating habits, litter-mates are used in a study of resistance to cancer in mice, and so on. When you know in advance which observations form non-independent sets, one option is to average them, and let $n$ be the number of independent sets of observations. There are also ways to incorporate non-independence into the statistical model. We will discuss repeated measures designs, multivariate analysis and other examples later. \subsection{The standard $F$ and $t$-tests} SAS \texttt{proc reg} (like other programs) usually starts with an overall $F$-test, which tests all the explanatory variables in the equation simultaneously. If this test is significant, we can conclude that one or more of the explanatory variables is related to the response variable. Again like most programs that do multiple regression, SAS produces $t$-tests for the individual regression coefficients. If one of these is significant, we can conclude that controlling for all other explanatory variables in the model, the explanatory variable in question is related to the response variable. That is, each variable is tested controlling for all the others. It is also possible to test subsets of explanatory variables, controlling for all the others. For example, in an educational assessment where students use 4 different textbooks, the variable "textbook" would be represented by 3 dummy variables. These variables could be tested simultaneously, controlling for several other variables such as parental education and income, child's past academic performance, experience of teacher, and so on. In general, to test a subset $A$ of explanatory variables while controlling for another subset $B$, fit a model with both sets of variables, and simultaneously test the $b$ coefficients of the variables in subset $A$; there is an $F$ test for this. This is 100\% equivalent to the following. Fit a model with just the variables in subset $B$, and calculate $R^2$. Then fit a second model with the $A$ variables as well as the $B$ variables, and calculate $R^2$ again. Test whether the increase in $R^2$ is significant. It's the same $F$ test. Call the regression model with all the explanatory variables the \textbf{Full Model}, and call the model with fewer explanatory variables (that is, the model without the variables being tested) the \textbf{Reduced Model}. Let $SSR_F$ represent the explained sum of squares from the full model, and $SSR_R$ represent the explained sum of squares from the reduced model. \begin{quest} Why is $SSR_F \geq SSR_R$? \end{quest} \begin{answ} In the full model, if the best-fitting hyperplane had all the $b$ coefficients corresponding to the extra variables equal to zero, it would fit exactly as well as the hyperplane of the reduced model. It could not do any worse. \end{answ} Since $R^2=\frac{SSR}{SSTO}$, it is clear that $SSR_F \geq SSR_R$ implies that adding explanatory variables to a regression model can only increase $R^2$. When these additional explanatory variables are correlated with explanatory variables already in the model (as they usually are in an observational study), \begin{itemize} \item Statistical significance can appear when it was not present originally, because the additional variables reduce error variation, and make estimation and testing more precise. \item Statistical significance that was originally present can disappear, because the new variables explain some of the variation previously attributed to the variables that were significant, so when one controls for the new variables, there is not enough explained variation left to be significant. This is especially true of the $t$-tests, in which each variable is being controlled for all the others. \item Even the signs of the $b$s can change, reversing the interpretation of how their variables are related to the response variable. This is why it's very important not to leave out important explanatory variables in an observational study. \end{itemize} The $F$-test for the full versus reduced model is based on the test statistic \begin{equation} F = \frac{(SSR_F-SSR_R)/r}{MSE_F}, \label{ExtraSS} \end{equation} where $r$ is the number of variables that are being simultaneously tested. That is, $r$ is the number of explanatory variables that are in the full model but not the reduced model. $MSE_F$ is the mean square error for the full model: $MSE_F = \frac{SSE_F}{n-p}$. Equation~\ref{ExtraSS} is a very general formula. As we will see, all the standard tests in regression and the usual (fixed effects) Analysis of Variance are special cases of this $F$-test. \subsubsection{Looking at the Formula for $F$} Formula~\ref{ExtraSS} reveals some important properties of the $F$-test. Bear in mind that the $p$-value is the area under the $F$-distribution curve \emph{above} the value of the $F$ statistic. Therefore, anything that makes the $F$ statistic bigger will make the $p$-value smaller, and if it is small enough, the results will be significant. And significant results are what we want, if in fact the full model is closer to the truth than the reduced model. \begin{itemize} \item Since there are $r$ more variables in the full model than in the reduced model, the numerator of (\ref{ExtraSS}) is the \emph{average} improvement in explained sum of squares when we compare the full model to the reduced model. Thus, some of the extra variables might be useless for prediction, but the test could still be significant at least one of them contributes a lot to the explained sum of squares, so that the \emph{average} increase is substantially more than one would expect by chance. \item On the other hand, useless extra explanatory variables can dilute the contribution of extra explanatory variables with modest but real explanatory power. \item The denominator is a variance estimate based on how spread out the residuals are. The smaller this denominator is, the larger the $F$ statistic is, and the more likely it is to be significant. Therefore, for a more sensitive test, it's desirable to \emph{control} extraneous sources of variation. \begin{itemize} \item If possible, always collect data on any potential explanatory variable that is known to have a strong relationship to the response variable, and include it in both the full model and the reduced model. This will make the analysis more sensitive, because increasing the explained sum of squares will reduce the unexplained sum of squares. You will be more likely to detect a real result as significant, because it will be more likely to show up against the reduced background noise. \item On the other hand, the denominator of formula~(\ref{ExtraSS}) for $F$ is $MSE_F = \frac{SSE_F}{n-p}$, where the number of explanatory variables is $p-1$. Adding useless explanatory variables to the model will increase the explained sum of squares by at least a little, but the denominator of $MSE_F$ will go down by one, making $MSE_F$ bigger, and $F$ smaller. The smaller the sample size $n$, the worse the effect of useless explanatory variables. You have to be selective. \item The (internal) validity of most experimental research depends on experimental designs and procedures that balance sources of extraneous variation evenly across treatments. But even better are careful experimental procedures that eliminate random noise altogether, or at least hold it to very low levels. Reduce sources of random variation, and the residuals will be smaller. The $MSE_F$ will be smaller, and $F$ will be bigger if something is really going on. \item Most response variables are just indirect reflections of what the investigator would really like to study, and in designing their studies, scientists routinely make decisions that are tradeoffs between expense (or convenience) and data quality. When response variables represent low-quality measurement, they essentially contain random variation that cannot be explained. This variation will show up in the denominator of (\ref{ExtraSS}), reducing the chance of detecting real results against the background noise. An example of a response variable that might have too much noise would be a questionnaire or subscale of a questionnaire with just a few items. \end{itemize} \end{itemize} The comments above sneaked in the topic of \textbf{statistical power} by discussing the formula for the $F$-test. Statistical power is \emph{the probability of getting significant results when something is really going on in the population}. It should be clear that high power is good. We have just seen that statistical power can be increased by including important explanatory variables in the study, by carefully controlled experimental conditions, and by quality measurement. Power can also be increased by increasing the sample size. All this is true in general, and does not depend on the use of the traditional $F$ test. Power and sample size are discussed further in Chapter~\ref{SAMPLESIZE}. \subsection{Connections between Explained Variation and Significance Testing} If you divide numerator and denominator of Equation~(\ref{ExtraSS}) by $SSTO$, the numerator becomes $(R^2_F - R^2_R)/s$, so we see that the $F$ test is based on change in $R^2$ when one moves from the reduced model to the full model. But the $F$ test for the extra variables (controlling for the ones in the reduced model) is based not just on $R^2_F - R^2_R$, but on a quantity that will be denoted by \begin{equation}\label{aformula} a = \frac{R^2_F - R^2_R}{1-R^2_R}. \end{equation} This expresses change in $R^2$ as a \emph{proportion} of the variation left unexplained by the reduced model. That is, it's the \emph{proportion of remaining variation} that the additional variables explain. This is actually a more informative quantity than simple change in $R^2$. For example, suppose you're controlling for a set of variables that explain 80\% of the variation in the response variable, and you test a variable that accounts for an additional 5\%. You have explained 25\% of the remaining variation -- much more impressive than 5\%. The $a$ notation is non-standard. It's sometimes called a squared multiple partial correlation, but the usual notation for partial correlations is intricate and hard to look at, so we'll just use $a$. You may recall that an $F$ test has two degree of freedom values, a numerator degrees of freedom and a denominator degrees of freedom. In the $F$ test for a full versus reduced model, the numerator degrees of freedom is $s$, the number of extra variables. The denominator degrees of freedom is $n-p$. Recall that the sample size is $n$, and if the regression model has an intercept, there are $p-1$ explanatory variables. Applying a bit of high school algebra to Equation~(\ref{ExtraSS}), we see that the relationship between $F$ and $a$ is \begin{equation} F = \left( \frac{n-p}{s} \right) \left( \frac{a}{1-a} \right). \label{F(a)} \end{equation} so that for any given sample size, the bigger $a$ is, the bigger $F$ becomes. Also, for a given value of $a \neq 0$, $F$ increases as a function of $n$. This means you can get a large $F$ (and if it's large enough it will be significant) from strong results and a small sample, \emph{or} from weak results and a large sample. Again, examining the formula for the $F$ statistic yields a valuable insight. Expression~(\ref{F(a)}) for $F$ can be turned around to express $a$ in terms of $F$, as follows: \begin{equation} a = \frac{sF}{n-p+sF} \label{a(F)} \end{equation} This is a useful formula, because scientific journals often report just $F$ values, degrees of freedom and $p$-values. It's easy to tell whether the results are significant, but not whether the results are strong in the sense of explained variation. But the equality~(\ref{a(F)}) above lets you recover information about strength of relationship from the $F$ statistic and its degrees of freedom. For example, based on a three-way ANOVA where the response variable is rot in potatoes, suppose the authors write ``The interaction of bacteria by temperature was just barely significant ($F$=3.26,~df=2,36,~$p$=0.05)." What we want to know is, once one controls for other effects in the model, what proportion of the remaining variation is explained by the temperature-by-bacteria interaction? We have $s$=2, $n-p=36$, and $a = \frac{2 \times 3.26}{36+(2 \times 3.26)}$ = 0.153. So this effect is explaining a respectable 15\% of the variation that remains after controlling for all the other main effects and interactions in the model. \section{Interactions in Regression: It Depends} \label{INTERACTIONSINREGRESSION} % See old word processed Ch. 7, p. 60, metric cars data! Rough draft begins on the next page. \includepdf[pages=-]{Interactions.pdf} % Include all pages % \subsection{Categorical by Quantitative}\label{QUANTBYCAT} % \subsection{Quantitative by Quantitative}\label{QUANTBYQUANT} % \subsection{Categorical by Categorical}\label{CATBYCAT} The discussion of interactions involving two or more categorical explanatory variables will be continued in Chapter~\ref{FACTORIALANOVA}. The details begin on page~\pageref{EFFECTSASCONTRASTS}. \section{Scheff\'e Tests for Regression} \label{SCHEFFEREGRESSION} This section provides a brief but very powerful extension of the Scheff\'e tests to multiple regression. Suppose the initial hypothesis is that $r$ regression coefficients all are equal to zero. We will follow up the initial test by testing whether $s$ linear combinations of these regression coefficients are different from zero; $s \leq r$. Notice that now we are testing \emph{linear combinations}, not just contrasts. If a set of coefficients are all zero, then any linear combination (weighted sum) of the coefficients is also zero. Thus the null hypotheses of the follow-up tests are implied by the null hypotheses of the initial test. As in the case of Scheff\'e tests for contrasts in one-way ANOVA, using an adjusted critical value guarantees simultaneous protection for all the follow-up tests at the same significance level as the initial test. This means we have proper follow-ups (See Section~\ref{PROPERFOLLOWUPS}). The formula for the adjusted Scheff\'e critical value is \begin{equation} \label{scrit2} f_{Sch} = \frac{r}{s} f_{crit}, \end{equation} where again, the null hypothesis of the initial test is that $r$ regression coefficients are all zero, and the null hypothesis of the follow-up test is that $r$ linear combinations of those coefficients are equal to zero. Actually, Formula~\ref{scrit2} is even more general. It applies to testing arbitrary linear combinations of regression coefficients. The initial test is a test of $r$ linear constraints\footnote{A linear constraint is just a statement that some linear combination equals a constant.} on the regression coefficients, and the follow-up test is a test of $s$ linear constraints, where $s F Model 2 24015 12008 99.10 <.0001 Error 18 2180.92741 121.16263 Corrected Total 20 26196 Root MSE 11.00739 R-Square 0.9167 Dependent Mean 181.90476 Adj R-Sq 0.9075 Coeff Var 6.05118 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -68.85707 60.01695 -1.15 0.2663 kids 1 1.45456 0.21178 6.87 <.0001 income 1 9.36550 4.06396 2.30 0.0333 \end{verbatim}\end{scriptsize} \noindent Here are some comments on the output file. \begin{itemize} \item First the ANOVA summary table for the overall $F$-test, testing all the explanatory variables simultaneously. In \texttt{C Total}, \texttt{C} means corrected for the sample mean. \item \texttt{Root MSE} is the square root of Mean Square Error (MSE). \item \texttt{Dep Mean} is the mean of the response variable. \item \texttt{C.V.} is the coefficient of variation -- the standard deviation divided by the mean. Who cares? \item \texttt{R-square} is $R^2$ \item \texttt{Adj R-sq}: Since $R^2$ never goes down when you add explanatory variables, models with more variables always look as if they are doing better. Adjusted $R^2$ is an attempt to penalize the usual $R^2$ for the number of explanatory variables in the model. It can be useful if you are trying to compare the predictive usefulness of models with different numbers of explanatory variables. \item \texttt{Parameter Estimates} are the $b$ values corresponding to the explanatory variables listed. The one corresponding to \texttt{Intercept} is $b_0$. \texttt{Standard Error} is the (estimated) standard deviation of the sampling distribution of $b$. It's the denominator of the $t$ test in the next column. \item The last column is a two-tailed $p$-value for the $t$-test, testing whether the regression coefficient is zero. \end{itemize} \noindent Here are some sample questions based on the output file. %%%%%%%% \begin{quest} Suppose we wish to test simultaneously whether number of kids 16 and under and average family income have any relationship to sales. Give the value of the test statistic, and the associated $p$-value. \end{quest} \begin{answ} $F=99.103$, $p<0.0001$ \end{answ} %%%%%%%% \begin{quest} What can you conclude from just this one test? \end{quest} \begin{answ} Sales is related to either number of kids 16 and under, or average family income, or both. But you'd never do this. You have to look at the rest of the printout to tell what's happening. \end{answ} %%%%%%%% \begin{quest} What percent of the variation in sales is explained by number of kids 16 and under and average family income? \end{quest} \begin{answ} 91.67\% \end{answ} %%%%%%%% \begin{quest} Controlling for average family income, is number of kids 16 and under related to sales? \begin{enumerate} \item What is the value of the test statistic? \item What is the $p$-value? \item Are the results significant? Answer Yes or No. \item Is the relationship positive, or negative? \end{enumerate} \end{quest} \begin{answ} \end{answ} \begin{enumerate} \item $t$ = 6.868 \item $p<0.0001$ \item Yes. \item Positive. \end{enumerate} \begin{quest} Controlling for number of kids 16 and under is average family income related to sales? \begin{enumerate} \item What is the value of the test statistic? \item What is the $p$-value? \item Are the results significant? Answer Yes or No. \item Is the relationship positive, or negative? \end{enumerate} \end{quest} \begin{answ} \end{answ} \begin{enumerate} \item $t$ = 2.305 \item $p=0.0333$ \item Yes. \item Positive. \end{enumerate} \begin{quest} What do you conclude from this entire analysis? Direct your answer to a statistician or researcher. \end{quest} \begin{answ} Number of kids 16 and under and average family income are both related to sales, even when each variable is controlled for the other. \end{answ} \begin{quest} What do you conclude from this entire analysis? Direct your answer to a person without statistical training. \end{quest} \begin{answ} Even when you allow for the number of kids 16 and under in a town, the higher the average family income in the town, the higher the average sales. When you allow for the average family income in a town, the higher the number of children under 16, the higher the average sales. \end{answ} \begin{quest} A new studio is to be opened in a town with 65,400 children 16 and under, and an average household income of \$17,600. What annual sales do you predict? \label{yhatquest} \end{quest} \begin{answ} $\widehat{Y} = b_0 + b_1x_1 + b_2x_2$ = -68.857073 + 1.454560*65.4 + 9.365500*17.6 = 191.104, so predicted annual sales = \$191,104. \end{answ} \begin{quest} For any fixed value of average income, what happens to predicted annual sales when the number of children under 16 increases by one thousand? \end{quest} \begin{answ} Predicted annual sales goes up by \$1,454. \end{answ} \begin{quest} What do you conclude from the $t$-test for the intercept? \end{quest} \begin{answ} Nothing. Who cares if annual sales equals zero for towns with no children under 16 and an average household income of zero? \end{answ} The final two questions ask for a proportion of remaining variation, the quantity we are denoting by $a$. In the published literature, sometimes all you have are reports of $t$-tests for regression coefficients. \begin{quest} Controlling for average household income, what proportion of the remaining variation is explained by number of children under 16? \end{quest} \begin{answ} Using $F=t^2$ and plugging into (\ref{a(F)}), we have $a=\frac{1 \times 6.868^2}{21-3+1 \times 6.868^2}$ = 0.691944, or around 70\% of the remaining variation. \end{answ} \begin{quest} Controlling for number of children under 16, what proportion of the remaining variation is explained by average household income? \label{asampq} \end{quest} \begin{answ} $a = \frac{2.305^2}{18+2.305^2} = 0.2278994$, or about 23\%. \end{answ} These $a$ values are large, but the sample size is small; after all, it's a textbook example, not real data. Now here is a program file that illustrates some options, and gives you a hint of what a powerful tool SAS \texttt{proc~reg} can be. \pagebreak \begin{verbatim} /* appdwaine2.sas */ title 'Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al'; title2 'With bells and whistles'; data portrait; infile '/folders/myfolders/dwaine.data'; input kids income sales; proc reg simple corr; /* "simple" prints simple descriptive statistics */ model sales = kids income / ss1; /* "ss1" prints Sequential SS */ output out=resdata predicted=presale residual=resale; /* Creates new SAS data set with Y-hat and e as additional variables*/ /* Now all the default F-test, in order */ allivs: test kids = 0, income = 0; inter: test intercept=0; child: test kids=0; money: test income=0; proc iml; /* Income controlling for kids: Full vs reduced by "hand" */ fcrit = finv(.95,1,18); print fcrit; /* Had to look at printout from an earlier run to get these numbers*/ f = 643.475809 / 121.16263; /* Using the first F formula */ pval = 1-probf(f,1,18); tsq = 2.305**2; /* t-squared should equal F*/ a = 643.475809/(26196.20952 - 23372); print f tsq pval; print "Proportion of remaining variation is " a; proc glm; /* Use proc glm to get a y-hat more easily */ model sales=kids income; estimate 'Xh p249' intercept 1 kids 65.4 income 17.6; proc print; /* To see the new data set with residuals*/ proc univariate normal plot; var resale; proc plot; plot resale * (kids income sales); \end{verbatim} Here are some comments on \texttt{appdwaine2.sas}. \begin{itemize} \item \texttt{simple corr} You could get means and standard deviations from \texttt{proc means} and correlations from \texttt{proc corr}, but this is convenient. \item \texttt{ss1} These are Type I Sums of Squares, produced by default in \texttt{proc glm}. In \texttt{proc reg}, you must request them with the \texttt{ss1} option if you want to see them. The explanatory variables in the \texttt{model} statement are added to the model in order. For each variable, the \texttt{Type~I~SS} is the \emph{increase} in explained sum of squares that comes from adding each variable to the model, in the order they appear in the \texttt{model} statement. The $t$-tests correspond to \texttt{proc glm}'s Type~III sums of squares; everything is controlled for everything else. \item \texttt{output} creates a new sas data set called \texttt{resdata}. It has all the variables in the data set \texttt{portrait}, and in addition it has $\widehat{Y}$ (named \texttt{presale} for predicted sales) and $e$ (named \texttt{resale} for residual of sales). \item Then we have some custom tests, all of them equivalent to what we would get by testing a full versus reduced model. SAS takes the approach of testing whether $s$ linear combinations of $\beta$ values equal $s$ specified constants (usually zero). Again, this is the same thing as testing a full versus a reduced model. The form of a custom test in \texttt{proc reg} is \begin{enumerate} \item A name for the test, 8 characters or less, followed by a colon; this name will be used to label the output. \item the word \texttt{test}. \item $s$ linear combinations of explanatory variable names, each set equal to some constant, separated by commas. \item A semi-colon to end, as usual. \end{enumerate} If you want to think of the significance test in terms of a collection of linear combinations that specify constraints on the $\beta$ values (this is what a statistician would appreciate), then we would say that the names of the explanatory variables (including the weird variable ``intercept") are being used to refer to the corresponding $\beta$s. But usually, you are testing a subset of explanatory variables controlling for some other subset. In this case, include all the variables in the \texttt{model} statement, and set the variables you are testing equal to zero in the \texttt{test} statement. Commas are optional. As an example, for the test \texttt{allivs} (all explanatory variables) we could have written \texttt{allivs:~test~kids~=~income~=~0;}. \item Now suppose you wanted to use the Sequential Sums of Squares to test \texttt{income} controlling for \texttt{kids}. You could use a calculator and a table of the $F$ distribution from a textbook, but for larger sample sizes the exact denominator degrees of freedom you need are seldom in the table, and you have to interpolate in the table. With \texttt{proc iml} (Interactive Matrix Language), which is actually a nice programming environment, you can use SAS as your calculator. Among other things, you can get exact critical values and $p$-values quite easily. Statistical tables are obsolete. In this example, we first get the \textbf{critical value} for $F$; \emph{if the test statistic is bigger than the critical value, the result is significant}. Then we calculate $F$ using formula \ref{ExtraSS}, and obtain its $p$-value. This $F$ should be equal to the square of the $t$ statistic from the printout, so we check. Then we use (\ref{a(F)}) to calculate $a$, and print the results. \item \texttt{proc glm} The \texttt{glm} procedure is very useful when you have categorical explanatory variables, because it makes your dummy variables for you. But it also can do multiple regression. This example calls attention to the \texttt{estimate} command, which lets you calculate $\widehat{Y}$ values more easily and with less chance of error compared to a calculator or \texttt{proc iml}. \item \texttt{proc print} prints all the data values, for all the variables. This is a small data set, so it's not producing a telephone book here. You can limit the variables and the number of cases it prints; see the manual or \emph{Applied statistics and the SAS programming language} \cite{cs91}. By default, all SAS procedures use the most recently created SAS data set; this is \texttt{resdata}, which was created by \texttt{proc reg} -- so the predicted values and residuals will be printed by \texttt{proc print}. \item You didn't notice, but \texttt{proc glm} also used \texttt{resdata} rather than \texttt{portrait}. But it was okay, because \texttt{resdata} has all the variables in \texttt{portrait}, and \emph{also} the predicted $Y$ and the residuals. \item \texttt{proc univariate} produces a lot of useful descriptive statistics, along with a fair amount of junk. The \texttt{normal} option gives some tests for normality, and texttt{plot} generates some line-printer plots like boxplots and stem-and-leaf displays. These are sometimes informative. It's a good idea to run the residuals (from the full model) through \texttt{proc univariate} if you're starting to take an analysis seriously. \item \texttt{proc plot} This is how you would plot residuals against variables in the model. It the data file had additional variables you were \emph{thinking} of including in the analysis, you could plot them against the residuals too, and look for a correlation. My personal preference is to start plotting residuals fairly late in the exploratory game, once I am starting to get attached to a regression model. \end{itemize} \noindent Here is the output. \begin{scriptsize}\begin{verbatim} Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 1 With bells and whistles The REG Procedure Number of Observations Read 21 Number of Observations Used 21 Descriptive Statistics Uncorrected Standard Variable Sum Mean SS Variance Deviation Intercept 21.00000 1.00000 21.00000 0 0 kids 1302.40000 62.01905 87708 346.71662 18.62033 income 360.00000 17.14286 6190.26000 0.94157 0.97035 sales 3820.00000 181.90476 721072 1309.81048 36.19130 Correlation Variable kids income sales kids 1.0000 0.7813 0.9446 income 0.7813 1.0000 0.8358 sales 0.9446 0.8358 1.0000 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 2 With bells and whistles The REG Procedure Model: MODEL1 Dependent Variable: sales Number of Observations Read 21 Number of Observations Used 21 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 24015 12008 99.10 <.0001 Error 18 2180.92741 121.16263 Corrected Total 20 26196 Root MSE 11.00739 R-Square 0.9167 Dependent Mean 181.90476 Adj R-Sq 0.9075 Coeff Var 6.05118 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 3 With bells and whistles The REG Procedure Model: MODEL1 Dependent Variable: sales Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Type I SS Intercept 1 -68.85707 60.01695 -1.15 0.2663 694876 kids 1 1.45456 0.21178 6.87 <.0001 23372 income 1 9.36550 4.06396 2.30 0.0333 643.47581 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 4 With bells and whistles The REG Procedure Model: MODEL1 Test allivs Results for Dependent Variable sales Mean Source DF Square F Value Pr > F Numerator 2 12008 99.10 <.0001 Denominator 18 121.16263 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 5 With bells and whistles The REG Procedure Model: MODEL1 Test inter Results for Dependent Variable sales Mean Source DF Square F Value Pr > F Numerator 1 159.48430 1.32 0.2663 Denominator 18 121.16263 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 6 With bells and whistles The REG Procedure Model: MODEL1 Test child Results for Dependent Variable sales Mean Source DF Square F Value Pr > F Numerator 1 5715.50583 47.17 <.0001 Denominator 18 121.16263 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 7 With bells and whistles The REG Procedure Model: MODEL1 Test money Results for Dependent Variable sales Mean Source DF Square F Value Pr > F Numerator 1 643.47581 5.31 0.0333 Denominator 18 121.16263 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 8 With bells and whistles fcrit 4.4138734 f tsq pval 5.3108439 5.313025 0.0333214 a Proportion of remaining variation is 0.2278428 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 9 With bells and whistles The GLM Procedure Number of Observations Read 21 Number of Observations Used 21 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 10 With bells and whistles The GLM Procedure Dependent Variable: sales Sum of Source DF Squares Mean Square F Value Pr > F Model 2 24015.28211 12007.64106 99.10 <.0001 Error 18 2180.92741 121.16263 Corrected Total 20 26196.20952 R-Square Coeff Var Root MSE sales Mean 0.916746 6.051183 11.00739 181.9048 Source DF Type I SS Mean Square F Value Pr > F kids 1 23371.80630 23371.80630 192.90 <.0001 income 1 643.47581 643.47581 5.31 0.0333 Source DF Type III SS Mean Square F Value Pr > F kids 1 5715.505835 5715.505835 47.17 <.0001 income 1 643.475809 643.475809 5.31 0.0333 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 11 With bells and whistles The GLM Procedure Dependent Variable: sales Standard Parameter Estimate Error t Value Pr > |t| Xh p249 191.103930 2.76679783 69.07 <.0001 Standard Parameter Estimate Error t Value Pr > |t| Intercept -68.85707315 60.01695322 -1.15 0.2663 kids 1.45455958 0.21178175 6.87 <.0001 income 9.36550038 4.06395814 2.30 0.0333 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 12 With bells and whistles Obs kids income sales presale resale 1 68.5 16.7 174.4 187.184 -12.7841 2 45.2 16.8 164.4 154.229 10.1706 3 91.3 18.2 244.2 234.396 9.8037 4 47.8 16.3 154.6 153.329 1.2715 5 46.9 17.3 181.6 161.385 20.2151 6 66.1 18.2 207.5 197.741 9.7586 7 49.5 15.9 152.8 152.055 0.7449 8 52.0 17.2 163.2 167.867 -4.6666 9 48.9 16.6 145.4 157.738 -12.3382 10 38.4 16.0 137.2 136.846 0.3540 11 87.9 18.3 241.9 230.387 11.5126 12 72.8 17.1 191.1 197.185 -6.0849 13 88.4 17.4 232.0 222.686 9.3143 14 42.9 15.8 145.3 141.518 3.7816 15 52.5 17.8 161.1 174.213 -13.1132 16 85.7 18.4 209.7 228.124 -18.4239 17 41.3 16.5 146.4 145.747 0.6530 18 51.7 16.3 144.0 159.001 -15.0013 19 89.6 18.1 232.6 230.987 1.6130 20 82.7 19.1 224.1 230.316 -6.2161 21 52.3 16.0 166.5 157.064 9.4356 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 13 With bells and whistles The UNIVARIATE Procedure Variable: resale (Residual) Moments N 21 Sum Weights 21 Mean 0 Sum Observations 0 Std Deviation 10.442527 Variance 109.046371 Skewness -0.0970495 Kurtosis -0.7942686 Uncorrected SS 2180.92741 Corrected SS 2180.92741 Coeff Variation . Std Error Mean 2.27874622 Basic Statistical Measures Location Variability Mean 0.000000 Std Deviation 10.44253 Median 0.744918 Variance 109.04637 Mode . Range 38.63896 Interquartile Range 15.65166 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t t 0 Pr > |t| 1.0000 Sign M 2.5 Pr >= |M| 0.3833 Signed Rank S 1.5 Pr >= |S| 0.9599 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 14 With bells and whistles The UNIVARIATE Procedure Variable: resale (Residual) Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk W 0.954073 Pr < W 0.4056 Kolmogorov-Smirnov D 0.147126 Pr > D >0.1500 Cramer-von Mises W-Sq 0.066901 Pr > W-Sq >0.2500 Anderson-Darling A-Sq 0.432299 Pr > A-Sq >0.2500 Quantiles (Definition 5) Quantile Estimate 100% Max 20.215072 99% 20.215072 95% 11.512629 90% 10.170574 75% Q3 9.435601 50% Median 0.744918 25% Q1 -6.216062 10% -13.113212 5% -15.001313 1% -18.423890 0% Min -18.423890 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 15 With bells and whistles The UNIVARIATE Procedure Variable: resale (Residual) Extreme Observations ------Lowest----- ------Highest----- Value Obs Value Obs -18.4239 16 9.75858 6 -15.0013 18 9.80368 3 -13.1132 15 10.17057 2 -12.7841 1 11.51263 11 -12.3382 9 20.21507 5 Stem Leaf # Boxplot 2 0 1 | 1 | 1 0002 4 | 0 99 2 +-----+ 0 011124 6 *--+--* -0 | | -0 665 3 +-----+ -1 332 3 | -1 85 2 | ----+----+----+----+ Multiply Stem.Leaf by 10**+1 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 16 With bells and whistles The UNIVARIATE Procedure Variable: resale (Residual) Normal Probability Plot 22.5+ *++++ | +++++ | ++*+* | **+*+* 2.5+ *****+* | *+++ | +++** | ++*+* * -17.5+ *++++* +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 17 With bells and whistles Plot of resale*kids. Legend: A = 1 obs, B = 2 obs, etc. | | 20 + A | | | | A 10 + A A A A A R | e | s | A i | A A d 0 + A A A u | a | A l | A A | -10 + | A A | A | A | A -20 + | -+---------+---------+---------+---------+---------+---------+---------+- 30 40 50 60 70 80 90 100 kids Plot of resale*income. Legend: A = 1 obs, B = 2 obs, etc. | | 20 + A | | | | A 10 + A A A B R | e | s | A i | A A d 0 + A A A u | a | A l | A A | -10 + | AA | A | A | A -20 + | ---+-------+-------+-------+-------+-------+-------+-------+-------+-- 15.5 16.0 16.5 17.0 17.5 18.0 18.5 19.0 19.5 income Dwaine Studios Example from Chapter 6 (Section 6.9) of Neter et al 19 With bells and whistles Plot of resale*sales. Legend: A = 1 obs, B = 2 obs, etc. | | 20 + A | | | | A 10 + AA A A A R | e | s | A i | A A d 0 + A A A u | a | A l | A A | -10 + | A A | A | A | A -20 + | -+---------+---------+---------+---------+---------+---------+---------+- 120 140 160 180 200 220 240 260 sales \end{verbatim}\end{scriptsize} \noindent Here are some comments. \begin{itemize} \item \texttt{proc reg} \begin{itemize} \item In the descriptive statistics produced by the \texttt{simple} option, one of the ``variables" is \texttt{INTERCEP}; it's our friend $X_0=1$. The SAS programmers (or the statisticians directing them) are really thinking of this as an explanatory variable. \item The Type I (sequential) sum of squares starts with \texttt{INTERCEP}, and a really big number for the explained sum of squares. Well, think of a reduced model that does not even have an intercept --- that is, one in which there are not only no explanatory variables, but the population mean is zero. Then add an intercept, so the full model is $E[Y]=\beta_0$. The least squares estimate of $\beta_0$ is $\overline{Y}$, so the improvement in explained sum of squares is $\sum_{i=1}^n (Y_i - \overline{Y})^2 = SSTO$. That's the first line. It makes sense, in a twisted way. \item Then we have the custom tests, which reproduce the default tests, in order. See how useful the \emph{names} of the custom tests can be? \end{itemize} \item \texttt{proc iml}: Everything works as advertised. $F=t^2$ except for rounding error, and $a$ is exactly what we got as the answer to Sample Question~\ref{asampq}. \item \texttt{proc glm} \begin{itemize} \item After an overall test, we get tests labelled \texttt{Type I SS} and \texttt{Type III SS}. As mentioned earlier, Type One sums of squares are sequential. Each variable is added in turn to the model, in the order specified by the model statement. Each one is tested controlling for the ones that precede it --- except that the denominator of the $F$ ratio is MSE from the model including \emph{all} the explanatory variables. \item When explanatory variables are correlated with each other and with the response variable, some of the variation in the response variable is being explained by the variation \emph{shared} by the correlated explanatory variables. Which one should get credit? If you use sequential sums of squares, the variable named first \emph{by you} gets all the credit. And your conclusions can change radically as a result of the order in which you name the explanatory variables. This may be okay, if you have strong reasons for testing $A$ controlling for $B$ and not the other way around. In Type Three sums of squares, each variable is controlled for \emph{all} the others. This way, nobody gets credit for the overlap. It's conservative, and valuable. Naturally, the last lines of Type I and Type III summary tables are identical, because in both cases, the last variable named is being controlled for all the others. \item I can never remember what Type II and Type IV sums of squares are. \item The \texttt{estimate} statement yielded an \texttt{Estimate}, that is, a $\widehat{Y}$ value, of 191.103930, which is what we got with a calculator as the answer to Sample Question \ref{yhatquest}. We also get a $t$-test for whether this particular linear combination differs significantly from zero --- insane in this particular case, but useful at other times. The standard error would be very useful if we were constructing confidence intervals or prediction intervals around the estimate, but we are not. \item Then we get a display of the $b$ values and associated $t$-tests, as in \texttt{proc reg}. \texttt{proc glm} produces these by default only when none of the explanatory variables is declared categorical with the \texttt{class} statement. If you have categorical explanatory variables, you can request parameter estimates with the \texttt{parms} option. \end{itemize} \item \texttt{proc print} output is self-explanatory. If you are using \texttt{proc print} to print a large number of cases, consider specifying a large page size in the \texttt{options} statement. Then, the \emph{logical} page length will be very long, as if you were printing on a long roll of paper, and SAS will not print a new page header with the date and title and so on every 24 line or 35 lines or whatever. \item \texttt{proc univariate}: There is so much output to explain, I almost can't stand it. I'll just hit a few high points here. \begin{itemize} \item \texttt{T:Mean=0} A $t$-test for whether the mean is zero. If the variable consisted of difference scores, this would be a matched $t$-test. Here, because the mean of residuals from a multiple regression is \emph{always} zero as a by-product of least-squares, $t$ is exactly zero and the $p$-value is exactly one. \item \texttt{M(Sign)} Sign test, a non-parametric equivalent to the matched $t$. \item \texttt{Sgn Rank} Wilcoxon's signed rank test, another non-parametric equivalent to the matched $t$. \item \texttt{W:Normal} A test for normality. As you might infer from \texttt{PrP(B)$, then $\mbox{Odds}(A)>\mbox{Odds}(B)$, and therefore $\ln(\mbox{Odds}(A))>\ln(\mbox{Odds}(B))$. That is, the bigger the probability, the bigger the log odds. \item Notice that the natural log is only defined for positive numbers. This is usually fine, because odds are always positive or zero. But if the odds are zero, then the natural log is either minus infinity or undefined -- so the methods we are developing here will not work for events of probability exactly zero or exactly one. What's wrong with a probability of one? You'd be dividing by zero when you calculated the odds. \item The natural log is the inverse of exponentiation, meaning that $\ln(e^x)=e^{\ln(x)}=x$, where $e$ is the magic non-repeating decimal number 2.71828\ldots. The number $e$ really is magical, appearing in such seemingly diverse places as the mathematical theory of epidemics, the theory of compound interest, and the normal distribution. \item The log of a product is the sum of logs: $\ln(ab)=\ln(a)+\ln(b)$, and $\ln(\frac{a}{b})=\ln(a)-\ln(b)$. This means the log of an odds \emph{ratio} is the difference between the two log odds quantities. \end{itemize} To get back to the main point, we adopt a linear regression model for the log odds of the event $Y=1$. As in normal regression, there is a conditional distribution of the response variable $Y$ for every configuration of explanatory variable values. Keeping the notation consistent with ordinary regression, we have $p-1$ explanatory variables, and the conditional distribution of the binary response variable $Y$ is completely specified by the log odds \begin{equation}\label{logodds} \ln\left(\frac{P(Y=1|\mathbf{X=x})}{P(Y=0|\mathbf{X=x})} \right) = \beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}. \end{equation} This is equivalent to a \emph{multiplicative} model for the odds \begin{eqnarray}\label{multmodel} \frac{P(Y=1|\mathbf{X=x})}{P(Y=0|\mathbf{X=x})} & = & e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}} \\ & = & e^{\beta_0} e^{\beta_1 x_1} \cdots e^{\beta_{p-1} x_{p-1}}, \nonumber \end{eqnarray} and to a distinctly non-linear model for the conditional probability of $Y=1$ given $\mathbf{X}=(x_1, \ldots, x_{p-1})$: \begin{equation} \label{probformula} P(Y=1|x_1, \ldots, x_{p-1}) = \frac{e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}}} {1+e^{\beta_0 + \beta_1 x_1 + \ldots + \beta_{p-1} x_{p-1}}}. \end{equation} \section{The meaning of the regression coefficients} In the log odds world, the interpretation of regression coefficients is similar to what we have seen in ordinary regression. $\beta_0$ is the intercept. It's the log odds of $Y=1$ when all explanatory variables equal zero. And $\beta_k$ is the increase in log odds of $Y=1$ when $x_k$ is increased by one unit, and all other explanatory variables are held constant. This is on the scale of log odds. But frequently, people choose to think in terms of plain old odds rather than log odds. The rest of this section is an explanation of the following statement: \emph{When $x_k$ is increased by one unit, and all other explanatory variables are held constant, the odds of $Y=1$ are multiplied by $e^{\beta_k}$.} That is, $e^{\beta_k}$ is an \textbf{odds ratio} --- the ratio of the odds of $Y=1$ when $x_k$ is increased by one unit, to the odds of $Y=1$ when $x_k$ is left alone. As in ordinary regression, this idea of holding all the other variables constant is what we mean when we speak of ``controlling" for them. \paragraph{Odds ratio with a single dummy variable} Here is statement that makes sense and seems like it should be approximately true: ``Among 50 year old men, the odds of being dead before age 60 are three times as great for smokers." We are talking about an odds ratio. \begin{displaymath} \frac{\mbox{Odds of death given smoker}}{\mbox{Odds of death given nonsmoker}} = 3 \end{displaymath} The point is not that the true odds ratio is exactly 3. The point is that this is a reasonable way to express how the chances of being alive might depend on whether you smoke cigarettes. Now represent smoking status by an indicator dummy variable, with $X=1$ meaning Smoker, and $X=0$ meaning nonsmoker; let $Y=1$ mean death within 10 years and $Y=0$ mean life. The logistic regression model~(\ref{logodds}) for the log odds of death given $x$ are \begin{displaymath} \mbox{Log odds} = \beta_0 + \beta_1 x, \end{displaymath} and from~(\ref{multmodel}), the odds of death given $x$ are \begin{displaymath} \mbox{Odds} = e^{\beta_0} e^{\beta_1 x}. \end{displaymath} The table below shows the odds of death for smokers and non-smokers. {\begin{center} \begin{tabular}{|l|c|l|} \hline \textbf{Group} & $x$ & \textbf{Odds of Death} \\ \hline Smokers & 1 & $e^{\beta_0} e^{\beta_1}$ \\ \hline Non-smokers & 0 & $e^{\beta_0}$ \\ \hline \end{tabular} \end{center}} \noindent Now it's easy to see that the odds ratio is \begin{displaymath} \frac{\mbox{Odds of death given smoker}}{\mbox{Odds of death given nonsmoker}} = \frac{e^{\beta_0} e^{\beta_1}}{e^{\beta_0}} = e^{\beta_1}. \end{displaymath} Our understanding of the regression coefficient $\beta_1$ follows from several properties of the function $f(t)=e^t$. \begin{itemize} \item $e^t$ is always positive. This is good because odds are non-negative, but the fact that $e^t$ is never zero reminds us that the logistic regression model cannot accommodate events of probability zero or one. \item $e^0=1$. So when $\beta_1=0$, the odds ratio is one. That is, the odds of $Y=1$ (and hence the probability that $Y=1$) are the same when $X=0$ and $X=1$. That is, the conditional distribution of $Y$ is identical for both values of $X$, meaning that $X$ and $Y$ are unrelated. \item $f(t)=e^t$ is an increasing function. So, when $\beta_1$ is negative, $e^{\beta_1}<1$. Therefore, the probability of $Y=1$ would be \emph{less} when $X=1$. But if $\beta_1$ is positive, then the odds ratio is greater than one, and the probability of $Y=1$ would be greater when $X=1$, as in our example. In this sense, the sign of $\beta_1$ tells us the direction of the relationship between $X$ and $Y$ --- just as in ordinary regression. \end{itemize} % R code % t <- seq(from=-3,to=3,by=.1) % y <- exp(t) % plot(t,y,type='l') % title("The Exponential Function y = e^t") \begin{center} \includegraphics[width=4in]{exponential} \end{center} It should be clear that all this discussion applies when \emph{any} single explanatory variable is increased by one unit; the increase does not have to be from zero to one. Now suppose that there are several explanatory variables. We hold all variables constant except $x_k$, and form an odds ratio. In the numerator is the odds of $Y=1$ when $x_k$ is increased by one unit, and in the denominator is the odds of $Y=1$ when $x_k$ is left alone. Both numerator and denominator are products (see Equation~\ref{multmodel}) and there is a lot of cancellation in numerator and denominator. We are left with $e^{\beta_k}$. These calculations are a lot like the ones shown in~(\ref{holdconst}) for regular regression; they will not be repeated here. But the conclusion is this. \emph{When $x_k$ is increased by one unit, and all other explanatory variables are held constant, the odds of $Y=1$ are multiplied by $e^{\beta_k}$.} \paragraph{``Analysis of covariance" with a binary outcome} Here is one more example. Suppose the cases are patients with cancer, and we are comparing three treatments -- radiation, chemotherapy and both. There is a single quantitative variable $X$, representing severity of the disease (a clinical judgement by the physician). The response variable is $Y=1$ if the patient is alive 12 months later, zero otherwise. The question is which treatment is most effective, controlling for severity of disease. Treatment will be represented by two indicator dummy variables: $d_1=1$ if the patient receives chemotherapy only, and $d_2=1$ if the patient receives radiation only. Odds of survival are shown in the table below. {\begin{center} \begin{tabular}{|l|c|c|c|} \hline \textbf{Treatment} &$d_1$&$d_2$& \textbf{Odds of Survival} = $e^{\beta_0}e^{\beta_1d_1}e^{\beta_2d_2}e^{\beta_3x}$ \\ \hline Chemotherapy & 1 & 0 &$e^{\beta_0}e^{\beta_1}e^{\beta_3x}$ \\ \hline Radiation & 0 & 1 &$e^{\beta_0}e^{\beta_2}e^{\beta_3x}$ \\ \hline Both & 0 & 0 &$e^{\beta_0}e^{\beta_3x}$ \\ \hline \end{tabular} \end{center}} For any given disease severity $x$, \begin{displaymath} \frac{\mbox{Survival odds with Chemo}}{\mbox{Survival odds with Both}} = \frac{e^{\beta_0}e^{\beta_1}e^{\beta_3x}}{e^{\beta_0}e^{\beta_3x}} = e^{\beta_1} \end{displaymath} and \begin{displaymath} \frac{\mbox{Survival odds with Radiation}}{\mbox{Survival odds with Both}} = \frac{e^{\beta_0}e^{\beta_2}e^{\beta_3x}}{e^{\beta_0}e^{\beta_3x}} = e^{\beta_2}. \end{displaymath} If $\beta_1=\beta_2=0$, then for any given level of disease severity, the odds of survival are the same in all three experimental conditions. So the test of $H_0: \beta_1=\beta_2=0$ would tell us whether, controlling for severity of disease, the three treatments differ in their effectiveness. \begin{quest} What would $\beta_1>0$ mean? \end{quest} \begin{answ} Allowing for severity of disease, chemotherapy alone yields a higher one-year survival rate than the combination treatment. This could easily happen. Chemotherapy drugs and radiation are both dangerous poisons. \end{answ} This example shows that as in ordinary regression, categorical explanatory variables may be represented by collections of dummy variables. But parallel slopes on the log odds scale translates to \emph{proportional} odds -- like the odds of $Y=1$ for Group 1 are always 1.3 times the odds of $Y=1$ for Group 2, regardless of the value of $x$. How realistic this is will depend upon the particular application. \section{Parameter Estimation by Maximum likelihood} Using formula~\ref{probformula} for the probability of $Y=1$ given the explanatory variable values, it is possible to calculate the probability of observing the data we did observe, for any set of $\beta$ values. One of R. A. Fisher's many good suggestions was to take as our estimates of $\beta_0$, $\beta_1$ and so forth, those values that make the probability of getting the data we actually \emph{did} observe as large as possible. Viewed as a function of the parameter values, the probability that we will get the data we actually did get is called the \emph{likelihood}. The parameter values that make this thing as big as possible are called \emph{maximum likelihood estimates}. Figure~\ref{likeplot} is a picture of this for one explanatory variable. The $\beta_0,\beta_1$ values located right under the peak is our set of maximum likelihood estimates. Of course it's hard to visualize in higher dimension, but the idea is the same. \begin{figure}% [here] \caption{Graph of the Likelihood Function for Simple Logistic Regression} \label{likeplot} \begin{center} \includegraphics[width=4in]{like} \end{center} \end{figure} In regular regression, maximum likelihood estimates are identical to least squares estimates, but not here (though they may be close for large samples). Also, the $\widehat{\beta}$ quantities can be calculated by an explicit formula for regular regression, while for logistic regression they need to be found numerically. That is, a program like SAS must calculate the likelihood function for a bunch of sets of $\beta$ values, and somehow find the top of the mountain. Numerical routines for maximum likelihood estimation essentially march uphill until they find a place where it is downhill in every direction. Then they stop. For some statistical methods, the place you find this way could be a so-called ``local maximum," something like the top of a foothill. You don't know you're not at the top of the highest peak, because you're searching blindfolded, just walking uphill and hoping for the best. Fortunately, this cannot happen with logistic regression. There is only one peak, and no valleys. Start anywhere, walk uphill, and when it levels off you're at the top. This is true regardless of the particular data values and the number of explanatory variables. \section{Chi-square tests} As in regular regression, you can test hypotheses by comparing a full, or unrestricted model to a reduced, or restricted model. Typically the reduced model is the same as the full, except that's it's missing one or more explanatory variables. But the reduced model may be restricted in other ways, for example by setting a collection of regression coefficients equal to one another, but not necessarily equal to zero. There are many ways to test hypotheses in logistic regression; most are large-sample chi-square tests. Two popular ones are likelihood ratio tests and Wald tests. \subsection{Likelihood ratio tests} Likelihood ratio tests are based on a direct comparison of the likelihood of the observed data assuming the full model to the likelihood of the data assuming the reduced model. Let ${\cal{L}}_F$ stand for the maximum probability (likelihood) of the observed data under the full model, and ${\cal{L}}_R$ stand for the maximum probability of the observed data under the reduced model. Dividing the latter quantity by the former yields a \emph{likelihood ratio}: $\frac{{\cal{L}}_R}{{\cal{L}}_F}$. It is the maximum probability of obtaining the sample data under the reduced model (null hypothesis), \emph{relative} to the maximum probability of obtaining the sample data under the null hypothesis under the full, or unrestricted model. As with regular regression, the model cannot fit the data better when it is more restricted, so the likelihood of the reduced model is always less than the likelihood of the full model. If it's a \emph{lot} less -- that is, if the observed data are a lot less likely assuming the reduced model than assuming the full model -- then this is evidence against the null hypothesis, and perhaps the null hypothesis should be rejected. Well, if the likelihood ratio is small, then the natural log of the likelihood ratio is a big negative number, and minus the natural log of the likelihood ratio is a big positive number. So is twice minus the natural log of the likelihood ratio. It turns out that if the null hypothesis is true and the sample size is large, then the quantity \begin{displaymath} G = -2 \ln \left( \frac{{\cal{L}}_R}{{\cal{L}}_F} \right) \end{displaymath} has an approximate chi-square distribution, with degrees of freedom equal to the number of non-redundant restrictions that the null hypothesis places on the set of $\beta$ parameters. For example, if three regression coefficients are set to zero under the null hypotheses, the degrees of freedom equal three. \subsection{Wald tests} You may recall that the Central Limit Theorem says that even when data come from a non-normal distribution, the sampling distribution of the sample mean is approximately normal for large samples. The Wald tests are based on a kind of Central Limit Theorem for maximum likelihood estimates. Under very general conditions that include logistic regression, a collection of maximum likelihood estimates has an approximate multivariate normal distribution, with means approximately equal to the parameters, and variance covariance matrix that has a complicated form, but can be calculated (or approximated as a by-product of the most common types of numerical maximum likelihood). This was discovered and proved by Abraham Wald, and is the basis of the Wald tests. It is pretty remarkable that he was able to prove this even for maximum likelihood estimates with no explicit formula. Wald was quite a guy. Anyway, if the null hypothesis is true, then a certain sum of squares of the maximum likelihood estimates has a large sample chi-square distribution. The degrees of freedom are the same as for the likelihood ratio tests, and for large enough sample sizes, the numerical values of the two tests statistics get closer and closer. SAS makes it convenient to do Wald tests and inconvenient to do most likelihood ratio tests, so we'll stick to the Wald tests in this course. \section{Logistic Regression with SAS} \section{Outcomes with more than two categories} % Logistic regression with more than 2 outcomes: latexit work % Model for three categories % \ln\left(\frac{\pi_1}{\pi_3} \right ) & = & % \beta_{0,1} + \beta_{1,1} x_1 + \ldots + \beta_{p-1,1} x_{p-1} \\ \\ % \ln\left(\frac{\pi_2}{\pi_3} \right ) & = & % \beta_{0,2} + \beta_{1,2} x_1 + \ldots + \beta_{p-1,2} x_{p-1} % Meaning of the regression coefficients % \ln\left(\frac{\pi_1}{\pi_3} \right ) & = & L_1 \\ \\ % \ln\left(\frac{\pi_2}{\pi_3} \right ) & = & L_2 % Solve for the probabilities % \frac{\pi_1}{\pi_3} & = & e^{L_1} \\ \\ % \frac{\pi_2}{\pi_3} & = & e^{L_2} % \pi_1 & = & \pi_3 e^{L_1} \\ \\ % \pi_2 & = & \pi_3 e^{L_2} % \pi_1 & = & \pi_3 e^{L_1} \\ \\ % \pi_2 & = & \pi_3 e^{L_2} \\ \\ % \pi_1 + \pi_2 + \pi_3 & = & 1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Three linear equations in 3 unknowns % \pi_1 & = & \frac{e^{L_1}}{1+\sum_{j=1}^{k-1}e^{L_j}} \\ % \pi_2 & = & \frac{e^{L_2}}{1+\sum_{j=1}^{k-1}e^{L_j}} \\ % \pi_3 & = & \frac{1}{1+\sum_{j=1}^{k-1}e^{L_j}} % \pi_1 & = & \pi_k e^{L_1} \\ % & \vdots & \\ % \pi_{k-1} & = & \pi_k e^{L_{k-1}} \\ % \pi_1 + \cdots + \pi_k & = & 1 % Solution % \pi_1 & = & \frac{e^{L_1}}{1+e^{L_1}+e^{L_2}} \\ \\ % \pi_2 & = & \frac{e^{L_2}}{1+e^{L_1}+e^{L_2}} \\ \\ % \pi_k & = & \frac{1}{1+e^{L_1}+e^{L_2}} \section{Scheff\'e-like Tests for Logistic Regression} For logistic regression, there are Scheff\'e-like follow-up tests called \emph{union-intersection} tests. The primary source for union-intersection multiple comparisons is Gabriel's (1969) article~\cite{gabriel69}. Hochberg and Tamhane's (1987) monograph \emph{Multiple comparison procedures}~\cite{ht87} present Gabriel's discovery in an appendix. The true Scheff\'e tests are a special kind of union-intersection method that applies to the (multivariate) normal linear model. Scheff\'e tests have one property that is not true of union-intersection follow-ups in general: the guaranteed existence of a significant one-degree-of-freedom test. This is tied to geometric properties of the multivariate normal distribution. Just as in normal regression, suppose the initial null hypothesis is that $r$ coefficients in the logistic regression model are all equal to zero. We will follow up by testing whether $s$ linear combinations of these regression coefficients are different from zero; $s \leq r$. \textbf{The critical value for the follow-up tests is exactly that of the initial test: a chi-square with $r$ degrees of freedom}. This principle applies to both likelihood ratio and Wald tests. In fact, it is true of likelihood ratio and Wald tests in general, not just for logistic regression. Theoretically, the family of union-intersection follow-ups is embedded in the initial test, and it does not inflate the Type I error rate at all to take a look. \chapter{Factorial Analysis of Variance} \label{FACTORIALANOVA} \section{Concepts} % Feel free to repeat some of this in the SAS part A \emph{factor} is just another name for a categorical explanatory variable. The term is usually used in experimental studies with more than one categorical explanatory variable, where cases (subjects, patients, experimental units) are randomly assigned to treatment conditions that represent \emph{combinations} of the explanatory variable values. For example, consider an agricultural study in which the cases are plots of land (small fields), the response variable is crop yield in kilograms, and the explanatory variables are fertilizer type (three values) and type of irrigation (Sprinkler \emph{versus} Drip). Table~\ref{farmfactorial1} shows the six treatment combinations, one for each \emph{cell} of the table. \begin{table}% [here] \caption{A Two-Factor Design} \begin{center} \begin{tabular}{|l|c|c|c|} \hline & Fertilizer $1$ & Fertilizer $2$ & Fertilizer $3$ \\ \hline Sprinker Irrigation & & & \\ \hline Drip Irrigation & & & \\ \hline \end{tabular}\end{center} \label{farmfactorial1} \end{table} Table \ref{farmfactorial1} is an example of a \emph{complete} factorial design, in which data are collected for all combinations of the explanatory variable values. In an \emph{incomplete}, or \emph{frational} factorial design, certain treatment combinations are deliberately omitted, leading to $n=0$ in one or more cells. When done in an organized way\footnote{If it is safe to assume that certain contrasts of the treatment means equal zero, it is often possible to estimate and test other contrasts of interest even with zero observations in some cells. The feisibility of substituting \emph{assumptions} for missing data is an illustration of Data Analysis Hint~\ref{tradeoff} on page~\pageref{tradeoff}.}, this practice can save quite a bit of money --- say, in a crash test study where the cases are automobiles. In this course, we shall mostly confine our attention to complete factorial designs. Naturally, a factorial study can have more than two factors. The only limitations are imposed by time and budget. And there is more than one vocabulary floating around\footnote{This is typical. There are different \emph{dialects} of Statistics, corresponding roughly to groups of users from different disciples. These groups tend not to talk with one another, and often each one has its own tame experts. So the language they use, since it develops in near isolation, tends to diverge in minor ways.}. A three-factor design can also be described as a three-\emph{way} design; there is one ``way" for each dimension of the table of treatment means. When Sir Ronald Fisher (in whose honour the $F$-test is named) dreamed up factorial designs, he pointed out that they enable the scientist to investigate the effects of several explanatory variables at much less expense than if a separate experiment had to be conducted to test each one. In addition, they allow one to ask systematically whether the effect of one explanatory variable depends on the value of another explanatory variable. If the effect of one explanatory variable depends on another, we will say there is an \emph{interaction} between those variables. This kind of ``it depends" conclusion is a lot easier to see when both factors are systematically varied in the same study. Otherwise, one might easily think that the results two studies carried out under somewhat different conditions were inconsistent with one another. We talk about an $A$ ``by" $B$ or $A \times B$ interaction. Again, an interaction means ``it depends." A common beginner's mistake is to confuse the idea of an \emph{interaction} between variables with the idea of a \emph{relationship} between variables. They are different. Consider a version of Table~\ref{farmfactorial1} in which the cases are farms and the study is purely observational. A \emph{relationship} between Irrigation Type and Fertilizer Type would mean that farms using different types of fertilizer tend to use different irrigation systems; in other words, the percentage of farms using Drip irrigation would not be the same for Fertilizer Types $1$, $2$ and $3$. This is something that you might assess with a chi-square test of independence. But an \emph{interaction} between Irrigation Type and Fertilizer Type would mean that the effect of Irrigation Type on average crop yield \emph{depends} on the kind of fertilizer used. As we will see, this is equivalent to saying that certain contrasts of the treatment means are not all equal to zero. \subsection{Main Effects and Interactions as Contrasts} \label{EFFECTSASCONTRASTS} \paragraph{Testing for main effects by testing contrasts} Table~\ref{farmfactorial2} is an expanded version of Table~\ref{farmfactorial1}. In addition to population crop yield for each treatment combination (denoted by $\mu_1$ through $\mu_6$), it shows \emph{marginal means} -- quantities like $\frac{\mu_1+\mu_4}{2}$, which are obtained by averaging over rows or columns. If there are differences among marginal means for a categorical explanatory variable in a two-way (or higher) layout like this, we say there is a main effect for that variable. Tests for main effects are of great interest; they can indicate whether, averaging over the values of the other categorical explanatory variables in the design, whether the explanatory variable in question is related to the response variable. Note that averaging over the values of other explanatory variables is not the same thing as controlling for them, but it can still be very informative. \begin{table}% [here] \caption{A Two-Factor Design with Population Means} \begin{center} \begin{tabular}{c|c|c|c||c} %\hline & \multicolumn{3}{c||}{\textbf{Fertilizer}} & \\ \hline \textbf{Irrigation} & $1$ & $2$ & $3$ & \\ \hline Sprinker & $\mu_1$ & $\mu_2$ & $\mu_3$ & $\frac{\mu_1+\mu_2+\mu_3}{3}$ \\ \hline Drip & $\mu_4$ & $\mu_5$ & $\mu_6$ & $\frac{\mu_4+\mu_5+\mu_6}{3}$ \\ \hline\hline & $\frac{\mu_1+\mu_4}{2}$ & $\frac{\mu_2+\mu_5}{2}$ & $\frac{\mu_3+\mu_6}{2}$ & \\ % \hline \end{tabular}\end{center} \label{farmfactorial2} \end{table} Notice how any difference between marginal means corresponds to a \emph{contrast} of the treatment means. It helps to string out all the combinations of factor levels into one long categorical explanatory variable. Let's call this a \emph{combination variable}. For the crop yield example of Tables~\ref{farmfactorial1} and~~\ref{farmfactorial2}, the combination variable has six values, corresponding to the six treatment means $\mu_1$ through $\mu_6$ in the table. Suppose we wanted to test whether, averaging across fertilizer types, the two irrigation methods result in different average crop yield. This is another way of saying we want to test for difference between two different marginal means. \begin{quest} \label{maineffectirr} \end{quest} \noindent For the crop yield study of Table~\ref{farmfactorial2}, suppose we wanted to know whether, averaging across different fertilizers, method of irrigation is related to average crop yield. \begin{enumerate} \item Give the null hypothesis in symbols. \item Make a table showing the weights of the contrast or contrasts of treatment means you would test to answer the question. There should be one row for each contrast. The null hypothesis will be that all the contrasts equal zero. \end{enumerate} \begin{answ} \end{answ} \begin{enumerate} \item $\frac{\mu_1+\mu_2+\mu_3}{3}=\frac{\mu_4+\mu_5+\mu_6}{3}$ \item \begin{tabular}{|r|r|r|r|r|r|} \hline $a_1$ & $a_2$ & $a_3$ & $a_4$ & $a_5$ & $a_6$ \\ \hline\hline 1 & 1 & 1 & -1 & -1 & -1 \\ \hline \end{tabular} \end{enumerate} \begin{quest} \end{quest} \noindent Suppose we wanted to test for the main effect(s) of Irrigation Type. \begin{enumerate} \item Give the null hypothesis in symbols. \item Make a table showing the weights of the contrast or contrasts of treatment means you would test to answer the question. There should be one row for each contrast. The null hypothesis will be that all the contrasts equal zero. \end{enumerate} \begin{answ} \end{answ} This is the same as Sample Question~\ref{maineffectirr}, and has the same answer. \begin{quest}\label{maineffectfert} \end{quest} \noindent Suppose we wanted to know whether, averaging across different methods of irrigation, type of fertilizer is related to average crop yield. \begin{enumerate} \item Give the null hypothesis in symbols. \item Make a table showing the weights of the contrast or contrasts of treatment means you would test to answer the question. There should be one row for each contrast. The null hypothesis will be that all the contrasts equal zero. \end{enumerate} \begin{answ} \end{answ} \begin{enumerate} \item $\frac{\mu_1+\mu_4}{2} = \frac{\mu_2+\mu_5}{2} = \frac{\mu_3+\mu_6}{2}$ \item \begin{tabular}{|r|r|r|r|r|r|} \hline $a_1$ & $a_2$ & $a_3$ & $a_4$ & $a_5$ & $a_6$ \\ \hline\hline 1 & -1 & 0 & 1 & -1 & 0 \\ \hline 0 & 1 & -1 & 0 & 1 & -1 \\ \hline \end{tabular} \end{enumerate} %\noindent In the answers to Sample Questions~\ref{maineffectirr} and~\ref{maineffectfert}, notice that we are testing differences between marginal means, and the number of contrasts is equal to the number of equals signs in the null hypothesis. \paragraph{Testing for interactions by testing contrasts} Now we will see that tests for interactions --- that is, tests for whether the effect of a factor \emph{depends} on the level of another factor --- can also be expressed as tests of contrasts. For the crop yield example, consider this question: Does the effect of Irrigation Type depend on the type of fertilizer used? For Fertilizer Type $1$, the effect of Irrigation Type is represented by $\mu_1-\mu_4$. For Fertilizer Type $2$, it is represented by $\mu_2-\mu_5$, and for Fertilizer Type $2$, the effect of Irrigation Type is $\mu_3-\mu_6$. Thus the null hypothesis of \emph{no} interaction may be written \begin{equation}\label{interactionH0a} H_0: \mu_1-\mu_4 = \mu_2-\mu_5 = \mu_3-\mu_6. \end{equation} Because it contains two equals signs, the null hypothesis~(\ref{interactionH0a}) is equivalent to saying that two contrasts of the treatment means are equal to zero. Here are the weights of the contrasts, in tabular form. \begin{center} \begin{tabular}{|r|r|r|r|r|r|} \hline $a_1$ & $a_2$ & $a_3$ & $a_4$ & $a_5$ & $a_6$ \\ \hline\hline 1 & -1 & 0 & -1 & 1 & 0 \\ \hline 0 & 1 & -1 & 0 & -1 & 1 \\ \hline \end{tabular} \end{center} One way of saying that there is an \emph{interaction} between Irrigation Method and Fertilizer Type is to say that the effect of Irrigation Method depends on Fertilizer Type, and now it is clear how to set up the null hypothesis. But what if the interaction were expressed in the opposite way, by saying that the effect of Fertilizer Type depends on Irrigation Method? It turns out these two ways of expressing the concept are 100\% equivalent. They imply exactly the same null hypothesis, and the significance tests will be identical. \subsection{Graphing Interactions} Figure~\ref{IPlot1} shows a hypothetical pattern of population treatment means. There are main effects for both factors, but no interaction. \begin{figure}% [here] \caption{Main Effects But No Interaction} \begin{center} \includegraphics[width=4in]{CropYieldInteractionPlot1} \end{center} \label{IPlot1} \end{figure} For each irrigation method, the effect of fertilizer type corresponds to a \emph{profile} -- a curve showing the pattern of means for the various fertilizer types. If the profiles are parallel, then the effects of fertilizer type are the same within each irrigation method. In Figure~\ref{IPlot1}, the profiles are parallel, meaning there is no interaction. Of course Fertilizer Type is a nominal scale variable; it consists of unordered categories. Still, even though there is nothing in between Fertilizer Types 1 and 2 or between 2 and 3, it helps visually to connect the dots. There are two natural ways to express the parallel profiles in Figure~\ref{IPlot1}. One way is to say that the distance between the curves is the same at every point along the Fertilizer Type axis. This directly gives the null hypothesis in Expression~(\ref{interactionH0a}). The other way for the profiles to be parallel is for the line segments connecting the means for Fertilizer Types 1 and 2 to have the same slope, \emph{and} for the line segments connecting the means for Fertilizer Types 2 and 3 to have the same slope. That is, \begin{equation}\label{interactionH0b} H_0: \mu_2-\mu_1 = \mu_5-\mu_4 \mbox{ and } \mu_3-\mu_2 = \mu_6-\mu_5. \end{equation} The first statement in Expression~(\ref{interactionH0b}) may easily be re-arranged to yield $\mu_2-\mu_5 = \mu_1-\mu_4$, while the second statement may be re-arranged to yield $\mu_3-\mu_6 = \mu_2-\mu_5$. Thus, the null hypotheses in Expressions~(\ref{interactionH0a}) and~(\ref{interactionH0b}) are algebraically equivalent. They are just different ways of writing the same null hypothesis, and it doesn't matter which one you use. Fortunately, this is a very general phenomenon. %\emph{Regardless of the number of factors or the number of levels for each factor, all correct ways of writing interactions in terms of contrasts are mathematically equivalent}. \subsection{Higher order designs (More than two factors)} The extension to more than two factors is straightforward. Suppose that for each combination of Irrigation Method and Fertilizer Type, a collection of plots was randomly assigned to several different types of pesticide (weed killer). Then we would have three factors: Irrigation Method, Fertilizer Type and Pesticide Type. \begin{itemize} \item For each explanatory variable, averaging over the other two variables would give marginal means -- the basis for estimating and testing for main effects. That is, there are three (sets of) main effects: one for Irrigation method, one for Fertilizer type, and one for Pesticide type. \item Averaging over each of the explanatory variables in turn, we would have a two-way marginal table of means for the other two variables, and the pattern of means in that table could show a two-way interaction. That is, there are three 2-facto interactions: Irrigation by Fertilizer, Irrigation by Pesticidde, and Fertilizer by Pesticide. \end{itemize} The full three-dimensional table of means would provide a basis for looking at a three-way, or three-factor interaction. The interpretation of a three-way interaction is that the nature of the two-way interaction depends on the value of the third variable. This principle extends to any number of factors, so we would interpret a six-way interaction to mean that the nature of the 5-way interaction depends on the value of the sixth variable. How would you graph a three-factor interaction? For each value of the third factor, make a separate two-factor plot like Figure~\ref{IPlot1}. Fortunately, the order in which one considers the variables does not matter. For example, we can say that the A by B interaction depends on the value of C, or that the A by C interaction depends on B, or that the B by C interaction depends on the value of A. The translations of these statements into algebra are all equivalent to one another, and lead to exactly the same test statistics and $p$-values for any set of data, always. Here are the three ways of describing the three-factor interaction for the Crop Yeld example. \begin{itemize} \item The nature of the Irrigation method by Fertilizer type interaction depends on the type of Pesticide. \item The nature of the Irrigation method by Pesticide type interaction depends on the type of Fertilizer. \item The nature of the Pesticide type by Fertilizer interaction depends on the Irrigation method. \end{itemize} Again, these statements are all equivalent. Use the one that is easiest to think about and talk about. This principle extends to any number of factors. As you might imagine, things get increasingly complicated as the number of factors becomes large. For a four-factor design, there are \begin{itemize} \item Four (sets of) main effects \item Six two-factor interactions \item Four three-factor interactions \item One four-factor interaction; the nature of the three-factor interaction depends on the value of the 4th factor \ldots \item There is an $F$-test for each one \end{itemize} Also, interpreting higher-way interactions -- that is, figuring out what they mean -- becomes more and more difficult for experiments with large numbers of factors. Once I knew a Psychology graduate student who obtained a significant 5-way interaction when she analyzed the data for her Ph.D. thesis. Nobody could understand it, so she disappeared for a week. When she came back, she said ``I've got it!" But nobody could understand her explanation. For reasons like this, sometimes the higher-order interactions are deliberately omitted from the full model in big experimental designs; they are never tested. Is this reasonable? Most of my answers are just elaborate ways to say I don't know. Regardless of how many factors we have, or how many levels there are in each factor, one can always form a combination variable -- that is, a single categorical explanatory variable whose values represent all the combinations of explanatory variable values in the factorial design. Then, tests for main effects and interactions appear as test for collections of contrasts on the combination variable. This is helpful, for at least three reasons. \begin{enumerate} \item Thinking of an interaction as a collection of contrasts can really help you understand what it \emph{means}. And especially for big designs, you need all the help you can get. \item Once you have seen the tests for main effects and interactions as collections of contrasts, it is straightforward to compose a test for any collection of effects (or components of an effect) that is of interest. \item Seeing main effects and interactions in terms of contrasts makes it easy to see how they can be modified to become Bonferroni or Scheff\'e follow-ups to an initial significant one-way ANOVA on the combination variable --- if you choose to follow this conservative data analytic strategy. \end{enumerate} \subsection{Effect coding} \label{EFFECTCODINGANOVA} While it is helpful to think of main effects and interactions in terms of contrasts, the details become unpleasant for designs with more than two factors. The combination variables become \emph{long}, and thinking of interactions as collections of differences between differences of differences can give you a headache. An alternative is to use a regression model with dummy variable coding. For almost any regression model with interactions between categorical explanatory variabls, the easiest dummy variable coding scheme is \emph{effect coding}. Recall from Section~\ref{EFFECTCODING} (see page~\pageref{EFFECTCODING}) that effect coding is just like indicator dummy variable coding with an intercept, except that the last category gets a minus one instead of a zero. For a single categorical explanatory variable (factor), the regression coefficients are deviations of the treatment means from the \emph{grand mean}, or mean of treatment means. Thus, the regression coefficients are exactly the \emph{effects} as described in standard textbooks on the analysis of variance. For the two-factor Crop Yield study of Table~\ref{farmfactorial1} on page~\pageref{farmfactorial1}, here is how the effect coding dummy variables would be defined for Fertiziler type and Irrigation method (Water). \begin{center} \begin{tabular}{|c|r|r|} \hline Fertilizer & $f_1$ & $f_2$ \\ \hline\hline 1 & 1 & 0 \\ \hline 2 & 0 & 1 \\ \hline 3 & -1 & -1 \\ \hline \end{tabular} ~~~~~~~~ \begin{tabular}{|r|r|r|} \hline Water & $w$ \\ \hline\hline Sprinkler & 1 \\ \hline Drip & -1 \\ \hline \end{tabular} \end{center} As in the quantitative by quantitative case (page~\pageref{QUANTBYQUANT}) than the quantitative by categorical case (page~\pageref{QUANTBYCAT}) the interaction effects are the regression coefficients corresponding to \emph{products} of explanatory variables. For a two-factor design, the products come from multiplying each dummy variable for one factor by each dummy variable for the other factor. You \emph{never} multiply dummy variables for the same factor with each other. Here is the regression equation for conditional expected crop yield. \begin{displaymath} E[Y|\mathbf{X}] = \beta_0 + \beta_1 f_1 + \beta_2 f_2 + \beta_3 w + \beta_4 f_1w + \beta_5 f_2w \end{displaymath} The last two explanatory variables are quite literally the products of the dummy variables for Fertilizer type and Irrigation method. To understand what we have, let's make a table showing the conditional expected value of the depedent varable for each treatment combination. \begin{table}% [here] \caption{Expected values in terms of regression coefficients with effect coding: Crop yield study} \begin{center} \begin{tabular}{|c|r|r|r|r|r|r|c|} \hline Fertilizer & Water & $f_1$ & $f_2$ & $w$ & $f_1w$ & $f_2w$ & $E[Y|\mathbf{X}]$ \\ \hline \hline 1 & Sprinkler & 1 & 0 & 1 & 1 & 0 & $\beta_0+\beta_1+\beta_3+\beta_4$ \\ \hline 1 & Drip & 1 & 0 & -1 & -1 & 0 & $\beta_0+\beta_1-\beta_3-\beta_4$ \\ \hline 2 & Sprinkler & 0 & 1 & 1 & 0 & 1 & $\beta_0+\beta_2+\beta_3+\beta_5$ \\ \hline 2 & Drip & 0 & 1 & -1 & 0 & -1 & $\beta_0+\beta_2-\beta_3-\beta_5$ \\ \hline 3 & Sprinkler & -1 & -1 & 1 & -1 & -1 & $\beta_0-\beta_1-\beta_2+\beta_3-\beta_4-\beta_5$ \\ \hline 3 & Drip & -1 & -1 & -1 & 1 & 1 & $\beta_0-\beta_1-\beta_2-\beta_3+\beta_4+\beta_5$ \\ \hline \end{tabular} \end{center} \label{effcodingwork} % Label must come after the table or numbering is wrong. \end{table} That's correct but not very informative, yet. In Table~\ref{farmfactorial3}, the means are arranged in a row by column form like Table~\ref{farmfactorial2}, except that rows and columns are transposed because it fits better on the page that way. \begin{table}% [here] \caption{Cell and marginal means in terms of regression coefficients with effect coding} \begin{center} \begin{tabular}{c|l|l||l} %\hline & \multicolumn{2}{c||}{\textbf{Irrigation}} & \\ \hline \textbf{Fert} & ~~~~~~~~~~Sprinkler & ~~~~~~~~~~~~Drip & \\ \hline $1$ & $\mu_1 = \beta_0+\beta_1+\beta_3+\beta_4$ & $\mu_4 = \beta_0+\beta_1-\beta_3-\beta_4$ & $\frac{\mu_1+\mu_4}{2} = \beta_0+\beta_1$ \\ \hline $2$ & $\mu_2 = \beta_0+\beta_2+\beta_3+\beta_5$ & $\mu_5 = \beta_0+\beta_2-\beta_3-\beta_5$ & $\frac{\mu_2+\mu_5}{2} = \beta_0+\beta_2$ \\ \hline $3$ & $\mu_3 = \beta_0-\beta_1-\beta_2+\beta_3-\beta_4-\beta_5$ & $\mu_6 = \beta_0-\beta_1-\beta_2-\beta_3+\beta_4+\beta_5$ & $\frac{\mu_3+\mu_6}{2} = \beta_0-\beta_1-\beta_2$ \\ \hline\hline & $\frac{\mu_1+\mu_2+\mu_3}{3} = \beta_0+\beta_3$ & $\frac{\mu_4+\mu_4+\mu_6}{3} = \beta_0-\beta_3$ & $\frac{1}{6}\sum_{j=1}^6 \mu_j = \beta_0$ \\ % \hline \end{tabular}\end{center} \label{farmfactorial3} \end{table} Immediately, it is clear what $\beta_0, \beta_1, \beta_2$ and $\beta_3$ mean. \begin{itemize} \item The intercept $\beta_0$ is the \emph{grand mean} --- the mean of (population) treatment means. It is also the mean of the marginal means, averaging over either rows or columns. \item $\beta_1$ is the difference between the marginal mean for Fertilizer Type 1 and the grand mean. \item $\beta_2$ is the difference between the marginal mean for Fertilizer Type 2 and the grand mean. \item So $\beta_1$ and $\beta_2$ are main effects for Fertilizer Type\footnote{Technically, there is a third main effect for Fertilizer Type: $\beta_1-\beta_2$. Any factor with $k$ levels has $k$ main effects that add up to zero.}. The marginal means for fertilizer Type are equal if and only if $\beta_1=\beta_2=0$. \item $\beta_3$ is the difference between the marginal mean for Irrigation by Sprinkler and the grand mean. And, $\beta_3=0$ if an only if the two marginal means for Irrigation method are equal. \end{itemize} Furthermore, the two remaining regression coefficients --- the ones corresponding to the product terms --- are interaction effects. On page~\pageref{interactionH0a}, the interaction between Irrigation method and Fertilizer type was expressed by saying that the effect of Irrigation method depended on Fertilizer type. The null hypothesis was that the effect of Irrigation method was identical for the three Fertilizer types. In other words, we had~(Equation~\ref{interactionH0a}) \begin{displaymath} H_0: \mu_1-\mu_4 = \mu_2-\mu_5 = \mu_3-\mu_6. \end{displaymath} Using Table~\ref{farmfactorial3} and substituting for the $\mu$s in terms of $\beta$s, a little algebra shows that this null hypothesis is equivalent to \begin{displaymath} \beta_4=\beta_5 = -\beta_4-\beta_5. \end{displaymath} This, in turn, is equivalent to saying that $\beta_4=\beta_5=0$. So to test for an interaction, we just test whether the regression coefficients for the product terms equal zero. \paragraph{General Rules} Everything in this example generalizes nicely to an arbitrary number of factors. \begin{itemize} \item The regression model has an intercept. \item Define effect coding dummy variables for each factor. If the factor has $k$ levels, there will be $k-1$ dummy variables. Each dummy variable has a one for one of the factor levels, minus one for the last level, and zero for the rest. \item Form new explanatory variables that are products of the dummy variables. For any pair of factors $A$ and $B$, multiply each dummy variable for $A$ by each dummy variable for $B$. \item If there are more than two factors, form all three-way products, 4-way products, and so on. \item It's not hard to get all the products for a multifactor design without missing any. After you have calculated all the products for factors $A$ and $B$, take the dummy variables for factor $C$ and \begin{itemize} \item Multiply each dummy variable for $C$ by each dummy variable for $A$. These products correspond to the $A \times C$ interaction. \item Multiply each dummy variable for $C$ by each dummy variable for $B$. These products correspond to the $B \times C$ interaction. \item Multiply each dummy variable for $C$ by each $A \times B$ product. These three-variable products correspond to the $A \times B \times C$ interaction. \end{itemize} \item It is straightforward to extend the process, multiplying each dummy variable for a fourth factor $D$ by the dummy variables and products in the $A \times B \times C$ set. And so on there. \item To test main effects (differences between marginal means) for a factor, the null hypothesis is that the regression coefficients for that factor's dummy variables are all equal to zero. \item For any two-factor interaction, test the regression coefficients corresponding to the two-way products. For three-factor interactions, test the three-way products, and so on. \item Quantitative covariates may be included in the model, with or without interactions between covariates, or between covariates and factors. They work as expected. Multi-factor analysis of covariance is just a big multiple regression model. \end{itemize} % Some sample questions? % Include one for regular indicator dummy var coding, or make HW. % Does this still work with interaction omitted? I think so. % Even count the dummmy variables, like a 3x4x2. \section{Two-factor ANOVA with SAS: The Potato Data} This was covered in class. % This comment is just for the 2012 edition. Later, simulate data for the crop yield example and integrate it into the discussion. \section{Another example: The Greenhouse Study} \label{GREENHOUSE} This is an extension of the \emph{tubes} example (see page~\pageref{TUBES}) of Section~\ref{TUBES}. The seeds of the canola plant yield a high-quality cooking oil. Canola is one of Canada's biggest cash crops. But each year, millions of dollars are lost because of a fungus that kills canola plants. Or is it just one fungus? All this stuff looks the same. It's a nasty black rot that grows fastest under moist, warm conditions. It looks quite a bit like the fungus that grows in between shower tiles. A team of botanists recognized that although the fungus may look the same, there are actually several different kinds that are genetically distinct. There are also quite a few strains of canola plant, so the questions arose \begin{itemize} \item Are some strains of fungus more aggressive than others? That is, do they grow faster and overwhelm the plant's defenses faster? \item Are some strains of canola plant more vulnerable to infection than others? \item Are some strains of fungus more dangerous to certain strains of plant and less dangerous to others? \end{itemize} These questions can be answered directly by looking at main effects and the interaction, so a factorial experiment was designed in which canola plants of three different varieties were randomly selected to be infected with one of six genetically different types of fungus. The way they did it was to scrape a little patch at the base of the plant, and wrap the wound with a moist band-aid that had some fungus on it. Then the plant was placed in a very moist dark environment for three days. After three days the bandage was removed and the plant was put in a commercial greenhouse. On each of 14 consecutive days, various measurements were made on the plant. Here, we will be concerned with lesion length, the length of the fungus patch on the plant, measured in millimeters. The response variable will be mean lesion length; the mean is over the 14 daily lesion length measurements for each plant. The explanatory variables are Cultivar (type of canola plant) and MCG (type of fungus). Type of plant is called cultivar because the fungus grows (is "cultivated") on the plant. MCG stands for ``Mycelial Compatibility Group." This strange name comes from the way that the botanists decided whether two types of fungus were genetically distinct. The would grow two samples on the same dish in a nutrient solution, and if the two fungus patches stayed separate, they were genetically different. If they grew together into a single patch of fungus (that is, they were compatible), then they were genetically identical. Apparently, this phenomenon is well established. Here is the SAS program \texttt{green1.sas}. As usual, the entire program is listed first. Then pieces of the program are repeated, together with pieces of output and discussion. \begin{verbatim} /* green1.sas */ %include '/folders/myfolders/ghread.sas'; options pagesize=100; proc freq; tables plant*mcg /norow nocol nopercent; proc glm; class plant mcg; model meanlng = plant|mcg; means plant|mcg; proc tabulate; class mcg plant; var meanlng ; table (mcg all),(plant all) * (mean*meanlng); /* Replicate tests for main effects and interactions, using contrasts on a combination variable. This is the hard way to do it, but if you can do this, you understand interactions and you can test any collection of contrasts. The definition of the variable combo could have been in ghread.sas */ data slime; set mould; /* mould was created by ghread91.sas */ if plant=1 and mcg=1 then combo = 1; else if plant=1 and mcg=2 then combo = 2; else if plant=1 and mcg=3 then combo = 3; else if plant=1 and mcg=7 then combo = 4; else if plant=1 and mcg=8 then combo = 5; else if plant=1 and mcg=9 then combo = 6; else if plant=2 and mcg=1 then combo = 7; else if plant=2 and mcg=2 then combo = 8; else if plant=2 and mcg=3 then combo = 9; else if plant=2 and mcg=7 then combo = 10; else if plant=2 and mcg=8 then combo = 11; else if plant=2 and mcg=9 then combo = 12; else if plant=3 and mcg=1 then combo = 13; else if plant=3 and mcg=2 then combo = 14; else if plant=3 and mcg=3 then combo = 15; else if plant=3 and mcg=7 then combo = 16; else if plant=3 and mcg=8 then combo = 17; else if plant=3 and mcg=9 then combo = 18; label combo = 'Plant-MCG Combo'; /* Getting main effects and the interaction with CONTRAST statements */ proc glm; class combo; model meanlng = combo; contrast 'Plant Main Effect' combo 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0, combo 0 0 0 0 0 0 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1; contrast 'MCG Main Effect' combo 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0, combo 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0, combo 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 0, combo 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0, combo 0 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1; contrast 'Plant by MCG Interaction' combo -1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0, combo 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1 0 0 0 0, combo 0 -1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0, combo 0 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1 0 0 0, combo 0 0 -1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0, combo 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1 0 0, combo 0 0 0 -1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0, combo 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1 0, combo 0 0 0 0 -1 1 0 0 0 0 1 -1 0 0 0 0 0 0, combo 0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1; /* proc reg's test statement may be easier, but first we need to make 16 dummy variables for cell means coding. This will illustrate arrays and loops, too */ data yucky; set slime; array mu{18} mu1-mu18; do i=1 to 18; if combo=. then mu{i}=.; else if combo=i then mu{i}=1; else mu{i}=0; end; proc reg; model meanlng = mu1-mu18 / noint; alleq: test mu1=mu2=mu3=mu4=mu5=mu6=mu7=mu8=mu9=mu10=mu11=mu12 = mu13=mu14=mu15=mu16=mu17=mu18; plant: test mu1+mu2+mu3+mu4+mu5+mu6 = mu7+mu8+mu9+mu10+mu11+mu12, mu7+mu8+mu9+mu10+mu11+mu12 = mu13+mu14+mu15+mu16+mu17+mu18; fungus: test mu1+mu7+mu13 = mu2+mu8+mu14 = mu3+mu9+mu15 = mu4+mu10+mu16 = mu5+mu11+mu17 = mu6+mu12+mu18; p_by_f: test mu2-mu1=mu8-mu7=mu14-mu13, mu3-mu2=mu9-mu8=mu15-mu14, mu4-mu3=mu10-mu9=mu16-mu15, mu5-mu4=mu11-mu10=mu17-mu16, mu6-mu5=mu12-mu11=mu18-mu17; /* Now illustrate effect coding, with the interaction represented by a collection of product terms. */ data nasty; set yucky; /* Two dummy variables for plant */ if plant=. then p1=.; else if plant=1 then p1=1; else if plant=3 then p1=-1; else p1=0; if plant=. then p2=.; else if plant=2 then p2=1; else if plant=3 then p2=-1; else p2=0; /* Five dummy variables for mcg */ if mcg=. then f1=.; else if mcg=1 then f1=1; else if mcg=9 then f1=-1; else f1=0; if mcg=. then f2=.; else if mcg=2 then f2=1; else if mcg=9 then f2=-1; else f2=0; if mcg=. then f3=.; else if mcg=3 then f3=1; else if mcg=9 then f3=-1; else f3=0; if mcg=. then f4=.; else if mcg=7 then f4=1; else if mcg=9 then f4=-1; else f4=0; if mcg=. then f5=.; else if mcg=8 then f5=1; else if mcg=9 then f5=-1; else f5=0; /* Product terms for interactions */ p1f1 = p1*f1; p1f2=p1*f2 ; p1f3=p1*f3 ; p1f4=p1*f4; p1f5=p1*f5; p2f1 = p2*f1; p2f2=p2*f2 ; p2f3=p2*f3 ; p2f4=p2*f4; p2f5=p2*f5; proc reg; model meanlng = p1 -- p2f5; plant: test p1=p2=0; mcg: test f1=f2=f3=f4=f5=0; p_by_f: test p1f1=p1f2=p1f3=p1f4=p1f5=p2f1=p2f2=p2f3=p2f4=p2f5 = 0; \end{verbatim} The SAS program starts with a \texttt{\%include} statement that reads \texttt{ghread.sas}. The file \texttt{ghread.sas} consists of a single big data step. We'll skip it, because all we really need are the two explanatory variables \texttt{plant} and \texttt{mcg}, and the response variable \texttt{meanlng}. Just to see what we've got, we do a \texttt{proc freq} to show the sample sizes. \begin{verbatim} proc freq; tables plant*mcg /norow nocol nopercent; \end{verbatim} and we get \begin{verbatim} TABLE OF PLANT BY MCG PLANT(Type of Plant) MCG(Mycelial Compatibility Group) Frequency| 1| 2| 3| 7| 8| 9| Total ---------+--------+--------+--------+--------+--------+--------+ GP159 | 6 | 6 | 6 | 6 | 6 | 6 | 36 ---------+--------+--------+--------+--------+--------+--------+ HANNA | 6 | 6 | 6 | 6 | 6 | 6 | 36 ---------+--------+--------+--------+--------+--------+--------+ WESTAR | 6 | 6 | 6 | 6 | 6 | 6 | 36 ---------+--------+--------+--------+--------+--------+--------+ Total 18 18 18 18 18 18 108 \end{verbatim} So it's a nice 3 by 6 factorial design, with 6 plants in each treatment combination. The \texttt{proc glm} for analyzing this is straightforward. Again, we get all main effects and interactions for the factor names separated by vertical bars. \begin{verbatim} proc glm; class plant mcg; model meanlng = plant|mcg; means plant|mcg; \end{verbatim} And the output is \begin{verbatim} General Linear Models Procedure Class Level Information Class Levels Values PLANT 3 GP159 HANNA WESTAR MCG 6 1 2 3 7 8 9 Number of observations in data set = 108 ------------------------------------------------------------------------------- 1991 Greenhouse Study 3 General Linear Models Procedure Dependent Variable: MEANLNG Average Lesion length Sum of Mean Source DF Squares Square F Value Pr > F Model 17 328016.87350 19295.11021 19.83 0.0001 Error 90 87585.62589 973.17362 Corrected Total 107 415602.49939 R-Square C.V. Root MSE MEANLNG Mean 0.789256 48.31044 31.195731 64.573479 Source DF Type I SS Mean Square F Value Pr > F PLANT 2 221695.12747 110847.56373 113.90 0.0001 MCG 5 58740.26456 11748.05291 12.07 0.0001 PLANT*MCG 10 47581.48147 4758.14815 4.89 0.0001 Source DF Type III SS Mean Square F Value Pr > F PLANT 2 221695.12747 110847.56373 113.90 0.0001 MCG 5 58740.26456 11748.05291 12.07 0.0001 PLANT*MCG 10 47581.48147 4758.14815 4.89 0.0001 \end{verbatim} Notice that the Type I and Type III tests are the same. This always happens when the sample sizes are equal. Now we take a look at marginal means and cell (treatment) means. This is the output of the \texttt{means} statement of \texttt{proc glm}. \begin{verbatim} 1991 Greenhouse Study 4 General Linear Models Procedure Level of -----------MEANLNG----------- PLANT N Mean SD GP159 36 14.055159 12.1640757 HANNA 36 55.700198 30.0137912 WESTAR 36 123.965079 67.0180440 Level of -----------MEANLNG----------- MCG N Mean SD 1 18 41.4500000 33.6183462 2 18 92.1333333 78.3509451 3 18 87.5857143 61.7086751 7 18 81.7603175 82.6711755 8 18 50.8579365 39.3417859 9 18 33.6535714 39.1480830 Level of Level of -----------MEANLNG----------- PLANT MCG N Mean SD GP159 1 6 12.863095 12.8830306 GP159 2 6 21.623810 17.3001296 GP159 3 6 14.460714 7.2165396 GP159 7 6 17.686905 16.4258441 GP159 8 6 8.911905 7.3162618 GP159 9 6 8.784524 6.5970501 HANNA 1 6 45.578571 26.1430472 HANNA 2 6 67.296429 30.2424997 HANNA 3 6 94.192857 20.2877876 HANNA 7 6 53.621429 24.8563497 HANNA 8 6 47.838095 12.6419109 HANNA 9 6 25.673810 17.1723150 WESTAR 1 6 65.908333 35.6968616 WESTAR 2 6 187.479762 45.1992178 WESTAR 3 6 154.103571 26.5469183 WESTAR 7 6 173.972619 79.1793105 WESTAR 8 6 95.823810 22.3712022 WESTAR 9 6 66.502381 52.5253101 \end{verbatim} The marginal are fairly easy to look at, and we definitely can construct a plot from the 18 cell means (or copy them into a nicer-looking table. But the following \texttt{proc tabulate} does the grunt work. In general, it's usually preferable to get the computer to do clerical tasks for you, especially if it's something you might want to do more than once. \begin{verbatim} proc tabulate; class mcg plant; var meanlng ; table (mcg all),(plant all) * (mean*meanlng); \end{verbatim} The syntax of proc tabulate is fairly elaborate, but at times it's worth the effort. Any reader who has seen the type of stub-and-banner tables favoured by professional market researchers will be impressed to hear that proc tabulate can come close to that. I figured out how to make the table below by looking in the manual. I then promptly forgot the overall principles, because it's not a tool I use a lot -- and the syntax is rather arcane. However, this example is easy to follow if you want to produce good-looking two-way tables of means. Here's the output. \begin{verbatim} ----------------------------------------------------------------------- | | Type of Plant | | | |--------------------------------------| | | | GP159 | HANNA | WESTAR | ALL | | |------------+------------+------------+------------| | | MEAN | MEAN | MEAN | MEAN | | |------------+------------+------------+------------| | | Average | Average | Average | Average | | | Lesion | Lesion | Lesion | Lesion | | | length | length | length | length | |-----------------+------------+------------+------------+------------| |Mycelial | | | | | |Compatibility | | | | | |Group | | | | | |-----------------| | | | | |1 | 12.86| 45.58| 65.91| 41.45| |-----------------+------------+------------+------------+------------| |2 | 21.62| 67.30| 187.48| 92.13| |-----------------+------------+------------+------------+------------| |3 | 14.46| 94.19| 154.10| 87.59| |-----------------+------------+------------+------------+------------| |7 | 17.69| 53.62| 173.97| 81.76| |-----------------+------------+------------+------------+------------| |8 | 8.91| 47.84| 95.82| 50.86| |-----------------+------------+------------+------------+------------| |9 | 8.78| 25.67| 66.50| 33.65| |-----------------+------------+------------+------------+------------| |ALL | 14.06| 55.70| 123.97| 64.57| ----------------------------------------------------------------------- \end{verbatim} The proc tabulate output makes it easy to graph the means. But before we do so, let's look at the main effects and interactions as collections of contrasts. This will actually make it easier to figure out what the results mean, once we see what they are. We have a three by six factorial design that looks like this. Population means are shown in the cells. The single-subscript notation encourages us to think of the combination of MCG and cultivar as a single categorical explanatory variable with 18 categories. \begin{table}% [here] \caption{Cell Means for the Greenhouse Study} \begin{center} \begin{tabular}{|c||c|c|c|c|c|c|} \hline & \multicolumn{6}{c||}{\textbf{MCG (Type of Fungus)}} \\ \hline \hline \textbf{Cultivar (Type of Plant)} & 1 & 2 & 3 & 7 & 8 & 9 \\ \hline GP159 & $\mu_1$ & $\mu_2$ & $\mu_3$ & $\mu_4$ & $\mu_5$ & $\mu_6$ \\ \hline Hanna & $\mu_7$ & $\mu_8$ & $\mu_9$ & $\mu_{10}$ & $\mu_{11}$ & $\mu_{12}$ \\ \hline Westar & $\mu_{13}$ & $\mu_{14}$ & $\mu_{15}$ & $\mu_{16}$ & $\mu_{17}$ & $\mu_{18}$ \\ \hline \end{tabular} \end{center} \label{ghcellmeans} % Label must come after the table or numbering is wrong. \end{table} Next is the part of the SAS program that creates the combination variable. Notice that it involves a data step that comes after the \texttt{proc glm}. This usually doesn't happen. I did it by creating a new data set called \texttt{slime} that starts by being identical to \texttt{mould}, which was created in the file \texttt{ghread.sas}. The \texttt{set} command is used to read in the data set \texttt{mould}, and then we start from there. This is done just for teaching purposes. Ordinarily, I would not create multiple data sets that are mostly copies of each other. I'd put the whole thing in one data step. Here's the code. Because all 18 possibilities are mentioned explicitly, anything else (like a missing value) is automatically missing. \begin{verbatim} data slime; set mould; /* mould was created by ghread91.sas */ if plant=1 and mcg=1 then combo = 1; else if plant=1 and mcg=2 then combo = 2; else if plant=1 and mcg=3 then combo = 3; else if plant=1 and mcg=7 then combo = 4; else if plant=1 and mcg=8 then combo = 5; else if plant=1 and mcg=9 then combo = 6; else if plant=2 and mcg=1 then combo = 7; else if plant=2 and mcg=2 then combo = 8; else if plant=2 and mcg=3 then combo = 9; else if plant=2 and mcg=7 then combo = 10; else if plant=2 and mcg=8 then combo = 11; else if plant=2 and mcg=9 then combo = 12; else if plant=3 and mcg=1 then combo = 13; else if plant=3 and mcg=2 then combo = 14; else if plant=3 and mcg=3 then combo = 15; else if plant=3 and mcg=7 then combo = 16; else if plant=3 and mcg=8 then combo = 17; else if plant=3 and mcg=9 then combo = 18; label combo = 'Plant-MCG Combo'; \end{verbatim} From Table~\ref{ghcellmeans}on page~\pageref{ghcellmeans}, iIt is clear that the absence of a main effect for Cultivar is the same as. \begin{equation} \label{maineffcultivar} \mu_1+\mu_2+\mu_3+\mu_4+\mu_5+\mu_6 = \mu_7+\mu_8+\mu_9+\mu_{10}+\mu_{11}+\mu_{12} = \mu_{13}+\mu_{14}+\mu_{15}+\mu_{16}. \end{equation} There are two equalities here, and they are saying that two contrasts of the eighteen cell means are equal to zero. To see why this is true, recall that a contrast of the 18 treatment means is a linear combination of the form \begin{displaymath} L = a_1\mu_1 + a_1\mu_2 + \ldots + a_{18}\mu_{18}, \end{displaymath} where the $a$ weights add up to zero. The table below gives the weights of the contrasts defining the test for the main effect of plant, one set of weights in each row. The first row corresponds to the first equals sign in Equation~\ref{maineffcultivar}. It says that \begin{displaymath} \mu_1+\mu_2+\mu_3+\mu_4+\mu_5+\mu_6 - (\mu_7+\mu_8+\mu_9+\mu_{10}+\mu_{11}+\mu_{12}) = 0. \end{displaymath} The second row corresponds to the first equals sign in Equation~\ref{maineffcultivar}. It says that \begin{displaymath} \mu_7+\mu_8+\mu_9+\mu_{10}+\mu_{11}+\mu_{12} - (\mu_{13}+\mu_{14}+\mu_{15}+mu_{16}) = 0. \end{displaymath} \begin{table}% [here] \caption{Weights of the linear combinations for testing a main effect of cultivar} \begin{center} \begin{tabular}{|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|} \hline $a_{1}$ & $a_{2}$ & $a_{3}$ & $a_{4}$ & $a_{5}$ & $a_{6}$ & $a_{7}$ & $a_{8}$ & $a_{9}$ & $a_{10}$ & $a_{11}$ & $a_{12}$ & $a_{13}$ & $a_{14}$ & $a_{15}$ & $a_{16}$ & $a_{17}$ & $a_{18}$ \\ \hline 1&1&1&1&1&1&-1&-1&-1&-1&-1&-1&0&0&0&0&0&0 \\ \hline 0&0&0&0&0&0&1&1&1&1&1&1&-1&-1&-1&-1&-1&-1 \\ \hline \end{tabular} \end{center} \label{maineffcultivarweights} % Label must come after the table or numbering is wrong. \end{table} Table~\ref{maineffcultivarweights} is the basis of the first \texttt{contrast} statement in \texttt{proc glm}. Notice how the contrasts are separated by commas. Also notice that the variable on which we're doing contrasts (\texttt{combo}) has to be repeated for each contrast. \begin{verbatim} /* Getting main effects and the interaction with CONTRAST statements */ proc glm; class combo; model meanlng = combo; contrast 'Plant Main Effect' combo 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 0 0 0 0 0 0, combo 0 0 0 0 0 0 1 1 1 1 1 1 -1 -1 -1 -1 -1 -1; \end{verbatim} If there is no main effect for MCG, we are saying \begin{displaymath} \mu_1+\mu_7+\mu_{13} = \mu_2+\mu_8+\mu_{14} = \mu_3+\mu_9+\mu_{15} = \mu_4+\mu_{10}+\mu_{16} = \mu_5+\mu_{11}+\mu_{17} = \mu_6+\mu_{12}+\mu_{18}. \end{displaymath} There are 5 contrasts here, one for each equals sign; there is always an equals sign for each contrast. Table~\ref{maineffmcgweights} shows the weights of the contrasts. \begin{table}% [here] \caption{Weights of the linear combinations for testing a main effect of MCG (Fungus type)} \begin{center} \begin{tabular}{|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|r|} \hline $a_{1}$ & $a_{2}$ & $a_{3}$ & $a_{4}$ & $a_{5}$ & $a_{6}$ & $a_{7}$ & $a_{8}$ & $a_{9}$ & $a_{10}$ & $a_{11}$ & $a_{12}$ & $a_{13}$ & $a_{14}$ & $a_{15}$ & $a_{16}$ & $a_{17}$ & $a_{18}$ \\ \hline 1&-1&0&0&0&0&1&-1&0&0&0&0&1&-1&0&0&0&0 \\ \hline 0&1&-1&0&0&0&0&1&-1&0&0&0&0&1&-1&0&0&0 \\ \hline 0&0&1&-1&0&0&0&0&1&-1&0&0&0&0&1&-1&0&0 \\ \hline 0&0&0&1&-1&0&0&0&0&1&-1&0&0&0&0&1&-1&0 \\ \hline 0&0&0&0&1&-1&0&0&0&0&1&-1&0&0&0&0&1&-1 \\ \hline \end{tabular} \end{center} \label{maineffmcgweights} % Label must come after the table or numbering is wrong. \end{table} And here is the corresponding test statement in \texttt{proc glm}. \begin{verbatim} contrast 'MCG Main Effect' combo 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0, combo 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0, combo 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0 0, combo 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1 0, combo 0 0 0 0 1 -1 0 0 0 0 1 -1 0 0 0 0 1 -1; \end{verbatim} To compose the Plant by MCG interaction, consider the hypothetical graph in Figure~\ref{ghnointeraction}. You can think of the ``effect" of MCG as a profile, representing a pattern of differences among means. If the three profiles are the same shape for each type of plant -- that is, if they are parallel, the effect of MCG does not depend on the type of plant, and there is no interaction. \begin{figure}% [here] \caption{No Interaction} \begin{center} \includegraphics[width=4in]{HypoPlantByMCG} \end{center} \label{ghnointeraction} \end{figure} % Need to make an open source version of this graph. I just printed this version to pdf from an old document and cropped it. For the profiles to be parallel, each set of corresponding line segments must be parallel. To start with the three line segments on the left, the rise represented by $\mu_2-\mu_1$ must equal the rise $\mu_8-\mu_7$, and $\mu_8-\mu_7$ must equal $\mu_14-\mu_13$. This is two contrasts that equal zero under the null hypothesis \begin{displaymath} \mu_2-\mu_1-\mu_8+\mu_7=0 \mbox{ and } \mu_8-\mu_7-\mu_{14}+\mu_{13}=0 \end{displaymath} There are two contrasts for each of the four remaining sets of three line segments, for a total of ten contrasts. They appear directly in the \texttt{contrast} statement of \texttt{proc glm}. Notice how each row adds to zero; these are \emph{contrasts}, not just linear combinations. \begin{verbatim} contrast 'Plant by MCG Interaction' combo -1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0, combo 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1 0 0 0 0, combo 0 -1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0, combo 0 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1 0 0 0, combo 0 0 -1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 0, combo 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1 0 0, combo 0 0 0 -1 1 0 0 0 0 1 -1 0 0 0 0 0 0 0, combo 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1 0, combo 0 0 0 0 -1 1 0 0 0 0 1 -1 0 0 0 0 0 0, combo 0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0 1 -1; \end{verbatim} Now we can compare the tests we get from these contrast statements with what we got from a two-way ANOVA. For easy reference, here is part of the two-way output. \begin{verbatim} Source DF Type III SS Mean Square F Value Pr > F PLANT 2 221695.12747 110847.56373 113.90 0.0001 MCG 5 58740.26456 11748.05291 12.07 0.0001 PLANT*MCG 10 47581.48147 4758.14815 4.89 0.0001 \end{verbatim} And here is the output from the contrast statements. \begin{verbatim} Contrast DF Contrast SS Mean Square F Value Pr > F Plant Main Effect 2 221695.12747 110847.56373 113.90 0.0001 MCG Main Effect 5 58740.26456 11748.05291 12.07 0.0001 Plant by MCG Interac 10 47581.48147 4758.14815 4.89 0.0001 \end{verbatim} So it worked. Here are some comments. \begin{itemize} \item Of course this is not the way you'd want to test for main effects and interactions. On the contrary, it makes you appreciate all the work that glm does for you when you say \texttt{model~meanlng~=~plant|mcg;} \item These contrasts are supposed to be an aid to understanding --- understanding what main effects and interactions really are, and understanding how you can test nearly any hypothesis you can think of in a multi-factor design. Almost without exception, what you want to do is test whether some collection of contrasts are equal to zero. Now you can do it, whether the collection you're interested in happens to be standard, or not. \item On the other hand, this was brutal. The size of the design made specifying those contrasts an unpleasant experience. There is an easier way. \end{itemize} \paragraph{Cell means coding} Because the \texttt{test} statement of \texttt{proc reg} has a more flexible syntax than the \texttt{contrast} statement of \texttt{proc glm}, it's a lot easier if you use cell means dummy variable coding, fit a model with no intercept in proc reg, and use test statements. In the following example, the indicator dummy variables are named $\mu_1$ to $\mu_{18}$. This choice makes it possible to directly transcribe statements about the population cell means into test statements\footnote{Here's why it works. In \texttt{test} statements, \texttt{proc reg} uses the name of the explanatory variable to stand for the regression coefficient for that explanatory variable. And with cell means coding, the regression coefficients ($\beta$ values) are identical to the cell means ($\mu$ values). So if the name of each cell means coding indicator is the same as the $\mu$ for that cell in the first place, you can just directly state the null hypothesis in the test statement.}. I highly recommend it. Of course if you really hate Greek letters, you could always name them $m_1$ to $m_{18}$ or something. First, we need to define 18 dummy variables. In general, it's a bit more tedious to define dummy variables than to make a combination variable. Here, I use the combination variable \texttt{combo} (which has already been created) to make the task a bit easier -- and also to illustrate the use of arrays and loops in the data step. The data set \texttt{yucky} below is the same as \texttt{slime}, except that it also has the eighteen indicators for the 18 combinations of \texttt{plant} and \texttt{mcg}. It's pretty self-explanatory, except that the name of the array does not need to be the same as the names of the variables. All you need is a valid SAS name for the array, and a list of variables. There can be more than one \texttt{array} statement, so you can have more than one array. \begin{verbatim} /* proc reg's test statement may be easier, but first we need to make 16 dummy variables for cell means coding. This will illustrate arrays and loops, too */ data yucky; set slime; array mu{18} mu1-mu18; do i=1 to 18; if combo=. then mu{i}=.; else if combo=i then mu{i}=1; else mu{i}=0; end; proc reg; model meanlng = mu1-mu18 / noint; alleq: test mu1=mu2=mu3=mu4=mu5=mu6=mu7=mu8=mu9=mu10=mu11=mu12 = mu13=mu14=mu15=mu16=mu17=mu18; plant: test mu1+mu2+mu3+mu4+mu5+mu6 = mu7+mu8+mu9+mu10+mu11+mu12, mu7+mu8+mu9+mu10+mu11+mu12 = mu13+mu14+mu15+mu16+mu17+mu18; fungus: test mu1+mu7+mu13 = mu2+mu8+mu14 = mu3+mu9+mu15 = mu4+mu10+mu16 = mu5+mu11+mu17 = mu6+mu12+mu18; p_by_f: test mu2-mu1=mu8-mu7=mu14-mu13, mu3-mu2=mu9-mu8=mu15-mu14, mu4-mu3=mu10-mu9=mu16-mu15, mu5-mu4=mu11-mu10=mu17-mu16, mu6-mu5=mu12-mu11=mu18-mu17; \end{verbatim} Looking again at the table of means (Table~\ref{ghcellmeans} on page~\pageref{ghcellmeans}), it's easy to see how natural the syntax is. And again, the tests are correct. First, repeat the output from the \texttt{contrast} statements of \texttt{proc glm} (which matched the \texttt{proc glm} two-way ANOVA output). \begin{verbatim} Contrast DF Contrast SS Mean Square F Value Pr > F Plant Main Effect 2 221695.12747 110847.56373 113.90 0.0001 MCG Main Effect 5 58740.26456 11748.05291 12.07 0.0001 Plant by MCG Interac 10 47581.48147 4758.14815 4.89 0.0001 \end{verbatim} Then, compare output from the test statements of proc reg. \begin{verbatim} Dependent Variable: MEANLNG Test: ALLEQ Numerator: 19295.1102 DF: 17 F value: 19.8270 Denominator: 973.1736 DF: 90 Prob>F: 0.0001 Dependent Variable: MEANLNG Test: PLANT Numerator: 110847.5637 DF: 2 F value: 113.9032 Denominator: 973.1736 DF: 90 Prob>F: 0.0001 Dependent Variable: MEANLNG Test: FUNGUS Numerator: 11748.0529 DF: 5 F value: 12.0719 Denominator: 973.1736 DF: 90 Prob>F: 0.0001 Dependent Variable: MEANLNG Test: P_BY_F Numerator: 4758.1481 DF: 10 F value: 4.8893 Denominator: 973.1736 DF: 90 Prob>F: 0.0001 \end{verbatim} Okay, now we know how to do anything. Finally, it is time to graph the interaction, and find out what these results mean! \begin{figure}% [here] \caption{Plant by MCG: Mean Lesion Length} \begin{center} \includegraphics[width=4in]{PlantByMCG} \end{center} \label{plantbymcg} \end{figure} % Need to make an open source version of this graph. I just printed this version to pdf from an old document and cropped it. First, we see a sizable and clear main effect for Plant. In fact, going back to the analysis of variance summary tables and dividing the Sum of Squares explained by Plant by the Total Sum of Squares, we observe that Plant explains around 53 percent of the variation in mean lesion length. That's huge. We will definitely want to look at pairwise comparisons of marginal means, too; we'll get back to this later. Looking at the pattern of means, it's clear that while the main effect of fungus type is statistically significant, this is not something that should be interpreted, because which one is best (worst) depends on the type of plant. That is, we need to look at the interaction. Before proceeding I should mention that many text advise us to \emph{never} interpret main effects if the interaction is statistically significant. I disagree, and Figure~\ref{plantbymcg} is a good example of why. It is clear that while the magnitudes of the differences depend on type of fungus, the lesion lengths are generally largest on Westar and smallest on GP159. So averaging over fungus types is a reasonable thing to do. This does not mean the interaction should be ignored; the three profiles really look different. In particular, GP159 not only has a smaller average lesion length, but it seems to exhibit less responsiveness to different strains of fungus. A test for the equality of $\mu_1$ through $\mu_6$ would be valuable. Pairwise comparisons of the 6 means for Hanna and the 6 means for Westar look promising, too. \paragraph{A Brief Consideration of Multiple Comparisons} The mention of pairwise comparisons brings up the issue of formal multiple comparison follow-up tests for this problem. The way people often do follow-up tests for factorial designs is to make a combination variable and then do all pairwise comparisons. It seems like they do this because they think it's the only thing the software will let them do. Certainly it's better than nothing. Here are some comments: \begin{itemize} \item With SAS, pairwise comparisons of cell means are not the only thing you can do. \texttt{Proc glm} will do all pairwise comparisons of marginal means quite easily. This means it's easy to follow up a significant and meaningful main effect. \item For the present problem, there are 120 possible pairwise comparisons of the 16 cell means. If we do all these as one-at-a-time tests, the chances of false significance are certainly mounting. There is a strong case here for protectng the tests at a single joint significance level. \item Since the sample sizes are equal, Tukey tests are most powerful for all pairwise comparisons. But it's not so simple. Pairwise comparisons within plants (for example, comparing the 6 means for Westar) are interesting, and pairwise comparisons within fungus types (for example, comparison of Hanna, Westar and GP159 for fungus Type 1) are interesting, but the remaining 57 pairwise comparisons are a lot less so. \item Also, pairwise comparisons of cell means are not all we want to do. We've already mentioned the need for pairwise comparisons of the marginal means for plants, and we'll soon see that other, less standard comparisons are of interest. \end{itemize} Everything we need to do will involve testing collections of contrasts. The approach we'll take is to do everything as a one-at-a-time custom test initially, and then figure out how we should correct for the fact that we've done a lot of tests. It's good to be guided by the data. Here we go. The analyses will be done in the SAS program \texttt{green2.sas}. As usual, the entire program is given first. But you should be aware that the program was written one piece at a time and executed many times, with later analyses being suggested by the earlier ones. The program starts by reading in the file \texttt{ghbread.sas}, which is just \texttt{ghread.sas} with the additional variables defined (especially \texttt{combo} and \texttt{mu1} through \texttt{mu18}) that were defined in \texttt{green1.sas}. \begin{verbatim} /* green2.sas: */ %include '/folders/myfolders/ghbread.sas'; options pagesize=100; proc glm; title 'Repeating initial Plant by MCG ANOVA, full design'; class plant mcg; model meanlng = plant|mcg; means plant|mcg; /* A. Pairwise comparisons of marginal means for plant, full design B. Test all GP159 means equal, full design C. Test profiles for Hanna & Westar parallel, full design */ proc reg; model meanlng = mu1-mu18 / noint; A_GvsH: test mu1+mu2+mu3+mu4+mu5+mu6 = mu7+mu8+mu9+mu10+mu11+mu12; A_GvsW: test mu1+mu2+mu3+mu4+mu5+mu6 = mu13+mu14+mu15+mu16+mu17+mu18; A_HvsW: test mu7+mu8+mu9+mu10+mu11+mu12 = mu13+mu14+mu15+mu16+mu17+mu18; B_G159eq: test mu1=mu2=mu3=mu4=mu5=mu6; C_HWpar: test mu8-mu7=mu14-mu13, mu9-mu8=mu15-mu14, mu10-mu9=mu16-mu15, mu11-mu10=mu17-mu16, mu12-mu11=mu18-mu17; /* D. Oneway on mcg, GP158 subset */ data just159; /* This data set will have just GP159 */ set mould; if plant=1; proc glm data=just159; title 'D. Oneway on mcg, GP158 subset'; class mcg; model meanlng = mcg; /* E. Plant by MCG, Hanna-Westar subset */ data hanstar; /* This data set will have just Hanna and Westar */ set mould; if plant ne 1; proc glm data=hanstar; title 'E. Plant by MCG, Hanna-Westar subset'; class plant mcg; model meanlng = plant|mcg; /* F. Plant by MCG followup, Hanna-Westar subset Interaction: Follow with all pairwise differences of Westar minus Hanna differences G. Differences within Hanna? H. Differences within Westar? */ proc reg; model meanlng = mu7-mu18 / noint; F_inter: test mu13-mu7=mu14-mu8=mu15-mu9 = mu16-mu10=mu17-mu11=mu18-mu12; F_1vs2: test mu13-mu7=mu14-mu8; F_1vs3: test mu13-mu7=mu15-mu9; F_1vs7: test mu13-mu7=mu16-mu10; F_1vs8: test mu13-mu7=mu17-mu11; F_1vs9: test mu13-mu7=mu18-mu12; F_2vs3: test mu14-mu8=mu15-mu9; F_2vs7: test mu14-mu8=mu16-mu10; F_2vs8: test mu14-mu8=mu17-mu11; F_2vs9: test mu14-mu8=mu18-mu12; F_3vs7: test mu15-mu9=mu16-mu10; F_3vs8: test mu15-mu9=mu17-mu11; F_3vs9: test mu15-mu9=mu18-mu12; F_7vs8: test mu16-mu10=mu17-mu11; F_7vs9: test mu16-mu10=mu18-mu12; F_8vs9: test mu17-mu11=mu18-mu12; G_Hanaeq: test mu7=mu8=mu9=mu10=mu11=mu12; H_Westeq: test mu13=mu14=mu15=mu16=mu17=mu18; proc glm data=hanstar; class combo; model meanlng = combo; lsmeans combo / pdiff adjust=scheffe; proc iml; title 'Table of Scheffe critical values for COLLECTIONS of contrasts'; title2 'Start with interaction'; numdf = 5; /* Numerator degrees of freedom for initial test */ dendf = 60; /* Denominator degrees of freedom for initial test */ alpha = 0.05; critval = finv(1-alpha,numdf,dendf); zero = {0 0}; S_table = repeat(zero,numdf,1); /* Make empty matrix */ /* Label the columns */ namz = {"Number of Contrasts in followup test" " Scheffe Critical Value"}; mattrib S_table colname=namz; do i = 1 to numdf; s_table(|i,1|) = i; s_table(|i,2|) = numdf/i * critval; end; reset noname; /* Makes output look nicer in this case */ print "Initial test has" numdf " and " dendf "degrees of freedom." "Using significance level alpha = " alpha; print s_table; proc iml; title 'Table of Scheffe critical values for COLLECTIONS of contrasts'; title2 'Start with all means equal'; numdf = 11; /* Numerator degrees of freedom for initial test */ dendf = 60; /* Denominator degrees of freedom for initial test */ alpha = 0.05; critval = finv(1-alpha,numdf,dendf); zero = {0 0}; S_table = repeat(zero,numdf,1); /* Make empty matrix */ /* Label the columns */ namz = {"Number of Contrasts in followup test" " Scheffe Critical Value"}; mattrib S_table colname=namz; do i = 1 to numdf; s_table(|i,1|) = i; s_table(|i,2|) = numdf/i * critval; end; reset noname; /* Makes output look nicer in this case */ print "Initial test has" numdf " and " dendf "degrees of freedom." "Using significance level alpha = " alpha; print s_table; proc reg data=hanstar; title 'One more try at following up the interaction'; model meanlng = mu7-mu18 / noint; onemore: test mu8-mu7 = mu14-mu13; \end{verbatim} After reading and defining the data with a \texttt{\%include} statement, the program repeats the initial three by six ANOVA from \texttt{green1.sas}. This is just for completeness. Then the SAS program performs tasks labelled \textbf{A} through \textbf{H}. \paragraph{Task A} \texttt{proc reg} is used to fit a cell means model, and then test for all three pairwise differences among Plant means. They are all significantly different from each other, confirming what appears visually in the interaction plot. \begin{verbatim} proc reg; model meanlng = mu1-mu18 / noint; A_GvsH: test mu1+mu2+mu3+mu4+mu5+mu6 = mu7+mu8+mu9+mu10+mu11+mu12; A_GvsW: test mu1+mu2+mu3+mu4+mu5+mu6 = mu13+mu14+mu15+mu16+mu17+mu18; A_HvsW: test mu7+mu8+mu9+mu10+mu11+mu12 = mu13+mu14+mu15+mu16+mu17+mu18; ------------------------------------------------------------------------------- Dependent Variable: MEANLNG Test: A_GVSH Numerator: 31217.5679 DF: 1 F value: 32.0781 Denominator: 973.1736 DF: 90 Prob>F: 0.0001 Dependent Variable: MEANLNG Test: A_GVSW Numerator: 217443.4318 DF: 1 F value: 223.4374 Denominator: 973.1736 DF: 90 Prob>F: 0.0001 Dependent Variable: MEANLNG Test: A_HVSW Numerator: 83881.6915 DF: 1 F value: 86.1940 Denominator: 973.1736 DF: 90 Prob>F: 0.0001 \end{verbatim} As mentioned earlier, GP159 not only has a smaller average lesion length, but it seems to exhibit less variation in its vulnerability to different strains of fungus. Part of the significant interaction must come from this, and part from differences in the profiles of Hanna and Westar. Two questions arise: \begin{enumerate} \item Are $\mu_1$ through $\mu_6$ (the means for GP159) actually different from each other? \item Are the profiles for Hanna and Westar different? \end{enumerate} There are two natural ways to address these questions. The naive way is to subset the data --- that is, do a one-way ANOVA to compare the 6 means for GP159, and a two-way (2 by 6) on the Hanna-Westar subset. In the latter analysis, the interaction of Plant by MCG would indicate whether the two profiles were different. A more sophisticated approach is not to subset the data, but to recognize that both questions can be answered by testing collections of contrasts of the entire set of 18 means; it's easy to do with the test statement of \texttt{proc reg}. The advantage of the sophisticated approach is this. Remember that the model specifies a conditional normal distribution of the response variable for each combination of explanatory variable values (in this case there are 18 combinations of explanatory variable values), and that each conditional distribution has the \emph{same variance}. The test for, say, the equality of $\mu_1$ through $\mu_6$ would use only $\overline{Y}_1$ through $\overline{Y}_6$ (that is, just GP159 data) to estimate the 5 contrasts involved, but it would use \emph{all} the data to estimate the common error variance. From both a commonsense viewpoint and the deepest possible theoretical viewpoint, it's better not to throw information away. This is why the sophisticated approach should be better. However, this argument is convincing only if it's really true that the response variable has the same variance for every combination of explanatory variable values. Repeating some output from the means command of the very first \texttt{proc glm}, \begin{verbatim} Level of Level of -----------MEANLNG----------- PLANT MCG N Mean SD GP159 1 6 12.863095 12.8830306 GP159 2 6 21.623810 17.3001296 GP159 3 6 14.460714 7.2165396 GP159 7 6 17.686905 16.4258441 GP159 8 6 8.911905 7.3162618 GP159 9 6 8.784524 6.5970501 HANNA 1 6 45.578571 26.1430472 HANNA 2 6 67.296429 30.2424997 HANNA 3 6 94.192857 20.2877876 HANNA 7 6 53.621429 24.8563497 HANNA 8 6 47.838095 12.6419109 HANNA 9 6 25.673810 17.1723150 WESTAR 1 6 65.908333 35.6968616 WESTAR 2 6 187.479762 45.1992178 WESTAR 3 6 154.103571 26.5469183 WESTAR 7 6 173.972619 79.1793105 WESTAR 8 6 95.823810 22.3712022 WESTAR 9 6 66.502381 52.5253101 \end{verbatim} We see that the sample standard deviations for GP159 look quite a bit smaller on average. Without bothering to do a formal test, we have some reason to doubt the equal variances assumption. It's easy to see why GP159 would have less plant-to-plant variation in lesion length. It's so resistant to the fungus that there's just not that much fungal growth, period. So there's less \emph{opportunity} for variation. Note that the equal variances assumption is essentially just a mathematical convenience. Here, it's clearly unrealistic. But what's the consequence of violating it? It's well known that the equal variance assumption can be safely violated if the cell sample sizes are equal and large. Well, here they're equal, but $n=6$ is not large. So this is not reassuring. It's not easy to say in general \emph{how} the tests will be affected when the equal variance assumption is violated, but for the two particular cases we're interested in here (are the GP159 means equal and are the Hanna and Westar profiles parallel), we can figure it out. Formula~\ref{ExtraSS} for the $F$-test (see page~\pageref{ExtraSS}) says \begin{displaymath} F = \frac{(SSR_F-SSR_R)/r}{MSE_F}. \end{displaymath} The denominator (Mean Squared Error from the full model) is the estimated population error variance. That's the variance that's supposed to be the same for each conditional distribution. Since \begin{displaymath} MSE = \frac{\sum_{i-1}^n(Y_i-\widehat{Y}_i)^2}{n-p} \end{displaymath} and the predicted value $\widehat{Y}_i$ is always the cell mean, we can draw the following conclusions. Assume that the true variance is smaller for GP159. \begin{enumerate} \item When we test for equality of the GP159 means, using the Hanna-Westar data to help compute MSE will make the denominator of F bigger than it should be. So $F$ will be smaller, and the test is too conservative. That is, it is less likely to detect differences that are really present. \item When we test whether the Hanna and Westar profiles are parallel, use of the GP159 data to help compute $MSE$ will make the denominator of $F$ \emph{smaller} than it should be -- so $F$ will be bigger, and the test will not be conservative enough. That is, the chance of significance if the effect is absent will be greater than 0.05. And a Type I error rate above 0.05 is always to be avaoided if possible. \end{enumerate} This makes me inclined to favour the "naive" subsetting approach. Because the GP159 means \emph{look} so equal, and I want them to be equal, I'd like to give the test for difference among them the best possible chance. And because it looks like the profiles for Hanna and Westar are not parallel (and I want them to be non-parallel, because it's more interesting if the effect of Fungus type depends on type of Plant), I want a more conservative test. Another argument in favour of subsetting is based on botany rather than statistics. Hanna and Westar are commercial canola crop varieties, but while GP159 is definitely in the canola family, it is more like a hardy weed than a food plant. It's just a different kind of entity, and so analyzing its data separately makes a lot of sense. You may wonder, if it's so different, why was it included in the design in the first place? Well, taxonomically it's quite similar to Hanna and Westar; really no one knew it would be such a vigorous monster in terms of resisting fungus. That's why people do research -- to find out things they didn't already know. Anyway, we'll do the analysis both ways -- both the seemingly naive way which is probably better once you think about it, and the sophisticated way that uses the complete set of data for all analyses. \paragraph{Tasks B and C} These represent the ``sophisticated" approach that does not subset the data. \begin{itemize} \item[\textbf{B}:] Test all GP159 means equal, full design \item[\textbf{C}:] Test profiles for Hanna and Westar parallel, full design \end{itemize} \begin{verbatim} proc reg; model meanlng = mu1-mu18 / noint; A_GvsH: test mu1+mu2+mu3+mu4+mu5+mu6 = mu7+mu8+mu9+mu10+mu11+mu12; A_GvsW: test mu1+mu2+mu3+mu4+mu5+mu6 = mu13+mu14+mu15+mu16+mu17+mu18; A_HvsW: test mu7+mu8+mu9+mu10+mu11+mu12 = mu13+mu14+mu15+mu16+mu17+mu18; B_G159eq: test mu1=mu2=mu3=mu4=mu5=mu6; C_HWpar: test mu8-mu7=mu14-mu13, mu9-mu8=mu15-mu14, mu10-mu9=mu16-mu15, mu11-mu10=mu17-mu16, mu12-mu11=mu18-mu17; ------------------------------------------------------------------------------- Dependent Variable: MEANLNG Test: B_G159EQ Numerator: 151.5506 DF: 5 F value: 0.1557 Denominator: 973.1736 DF: 90 Prob>F: 0.9778 Dependent Variable: MEANLNG Test: C_HWPAR Numerator: 5364.0437 DF: 5 F value: 5.5119 Denominator: 973.1736 DF: 90 Prob>F: 0.0002 \end{verbatim} This confirms the visual impression of no differences among means for GP159, and non-parallel profiles for Hanna and Westar. \paragraph{Task D} Now compare the subsetting approach. We will carry out a oneway ANOVA on MCG, using just the GP159 subset. Notice the creation of SAS data sets with subsets of the data. \begin{verbatim} data just159; /* This data set will have just GP159 */ set mould; if plant=1; proc glm data=just159; title 'D. Oneway on mcg, GP158 subset'; class mcg; model meanlng = mcg; ------------------------------------------------------------------------------- D. Oneway on mcg, GP158 subset 2 General Linear Models Procedure Dependent Variable: MEANLNG Average Lesion length Sum of Mean Source DF Squares Square F Value Pr > F Model 5 757.75319161 151.55063832 1.03 0.4189 Error 30 4421.01258503 147.36708617 Corrected Total 35 5178.76577664 R-Square C.V. Root MSE MEANLNG Mean 0.146319 86.37031 12.139485 14.055159 Source DF Type I SS Mean Square F Value Pr > F MCG 5 757.75319161 151.55063832 1.03 0.4189 Source DF Type III SS Mean Square F Value Pr > F MCG 5 757.75319161 151.55063832 1.03 0.4189 \end{verbatim} This analysis is consistent with what we got without subsetting the data. That is, it does not provide evidence that the means for GP159 are different. But when we didn't subset the data, we had $p = 0.9778$. This happened exactly because including Hanna and Westar data made $MSE$ larger, $F$ smaller, and hence $p$ bigger. \paragraph{Task E} Now we will do a Plant by MCG analysis, using just the Hanna-Westar subset of the data. \begin{verbatim} data hanstar; /* This data set will have just Hanna and Westar */ set mould; if plant ne 1; proc glm data=hanstar; title 'E. Plant by MCG, Hanna-Westar subset'; class plant mcg; model meanlng = plant|mcg; ------------------------------------------------------------------------------- E. Plant by MCG, Hanna-Westar subset 3 General Linear Models Procedure Class Level Information Class Levels Values PLANT 2 HANNA WESTAR MCG 6 1 2 3 7 8 9 Number of observations in data set = 72 ------------------------------------------------------------------------------- E. Plant by MCG, Hanna-Westar subset 4 General Linear Models Procedure Dependent Variable: MEANLNG Average Lesion length Sum of Mean Source DF Squares Square F Value Pr > F Model 11 189445.68433 17222.33494 12.43 0.0001 Error 60 83164.61331 1386.07689 Corrected Total 71 272610.29764 R-Square C.V. Root MSE MEANLNG Mean 0.694932 41.44379 37.230054 89.832639 Source DF Type I SS Mean Square F Value Pr > F PLANT 1 83881.691486 83881.691486 60.52 0.0001 MCG 5 78743.774570 15748.754914 11.36 0.0001 PLANT*MCG 5 26820.218272 5364.043654 3.87 0.0042 Source DF Type III SS Mean Square F Value Pr > F PLANT 1 83881.691486 83881.691486 60.52 0.0001 MCG 5 78743.774570 15748.754914 11.36 0.0001 PLANT*MCG 5 26820.218272 5364.043654 3.87 0.0042 \end{verbatim} The significant interaction indicates that the profiles for Hanna and Westar are non-parallel, confirming the visual impression we got from the interaction plot. But the p-value is larger this time. When all the data were used to calculate the error term, we had $p = 0.0002$; but now it rises to $p=0.0042$. This is definitely due to the low variation in GP159. Further analyses will be limited to the Hanna-Westar subset. Now think of the interaction in a different way. Overall, Hanna is more vulnerable than Westar, but the interaction says that the degree of that greater vulnerability depends on the type of fungus. For each of the 6 types of fungus, there is a \emph{difference} between Hanna and Westar. Let's look at parirwise differences of these differences. We might be able to say, then, something like this: ``The difference in vulnerability between Hanna and Westar is greater for Fungus Type 2 than Fungus Type 1." \paragraph{Task F:} Plant by MCG followup, Hanna-Westar subset. First, verify that the interaction can be expressed as a collection of differences betweeen differences. Of course it can. \begin{verbatim} proc reg; model meanlng = mu7-mu18 / noint; F_inter: test mu13-mu7=mu14-mu8=mu15-mu9 = mu16-mu10=mu17-mu11=mu18-mu12; F_1vs2: test mu13-mu7=mu14-mu8; F_1vs3: test mu13-mu7=mu15-mu9; F_1vs7: test mu13-mu7=mu16-mu10; F_1vs8: test mu13-mu7=mu17-mu11; F_1vs9: test mu13-mu7=mu18-mu12; F_2vs3: test mu14-mu8=mu15-mu9; F_2vs7: test mu14-mu8=mu16-mu10; F_2vs8: test mu14-mu8=mu17-mu11; F_2vs9: test mu14-mu8=mu18-mu12; F_3vs7: test mu15-mu9=mu16-mu10; F_3vs8: test mu15-mu9=mu17-mu11; F_3vs9: test mu15-mu9=mu18-mu12; F_7vs8: test mu16-mu10=mu17-mu11; F_7vs9: test mu16-mu10=mu18-mu12; F_8vs9: test mu17-mu11=mu18-mu12; ------------------------------------------------------------------------------- Dependent Variable: MEANLNG Test: F_INTER Numerator: 5364.0437 DF: 5 F value: 3.8699 Denominator: 1386.077 DF: 60 Prob>F: 0.0042 Dependent Variable: MEANLNG Test: F_1VS2 Numerator: 14956.1036 DF: 1 F value: 10.7902 Denominator: 1386.077 DF: 60 Prob>F: 0.0017 Dependent Variable: MEANLNG Test: F_1VS3 Numerator: 2349.9777 DF: 1 F value: 1.6954 Denominator: 1386.077 DF: 60 Prob>F: 0.1979 Dependent Variable: MEANLNG Test: F_1VS7 Numerator: 15006.4293 DF: 1 F value: 10.8265 Denominator: 1386.077 DF: 60 Prob>F: 0.0017 Dependent Variable: MEANLNG Test: F_1VS8 Numerator: 1147.2776 DF: 1 F value: 0.8277 Denominator: 1386.077 DF: 60 Prob>F: 0.3666 Dependent Variable: MEANLNG Test: F_1VS9 Numerator: 630.3018 DF: 1 F value: 0.4547 Denominator: 1386.077 DF: 60 Prob>F: 0.5027 Dependent Variable: MEANLNG Test: F_2VS3 Numerator: 5449.1829 DF: 1 F value: 3.9314 Denominator: 1386.077 DF: 60 Prob>F: 0.0520 Dependent Variable: MEANLNG Test: F_2VS7 Numerator: 0.0423 DF: 1 F value: 0.0000 Denominator: 1386.077 DF: 60 Prob>F: 0.9956 Dependent Variable: MEANLNG Test: F_2VS8 Numerator: 7818.7443 DF: 1 F value: 5.6409 Denominator: 1386.077 DF: 60 Prob>F: 0.0208 Dependent Variable: MEANLNG Test: F_2VS9 Numerator: 9445.7674 DF: 1 F value: 6.8147 Denominator: 1386.077 DF: 60 Prob>F: 0.0114 Dependent Variable: MEANLNG Test: F_3VS7 Numerator: 5479.5767 DF: 1 F value: 3.9533 Denominator: 1386.077 DF: 60 Prob>F: 0.0513 Dependent Variable: MEANLNG Test: F_3VS8 Numerator: 213.3084 DF: 1 F value: 0.1539 Denominator: 1386.077 DF: 60 Prob>F: 0.6962 Dependent Variable: MEANLNG Test: F_3VS9 Numerator: 546.1923 DF: 1 F value: 0.3941 Denominator: 1386.077 DF: 60 Prob>F: 0.5326 Dependent Variable: MEANLNG Test: F_7VS8 Numerator: 7855.1432 DF: 1 F value: 5.6672 Denominator: 1386.077 DF: 60 Prob>F: 0.0205 Dependent Variable: MEANLNG Test: F_7VS9 Numerator: 9485.7704 DF: 1 F value: 6.8436 Denominator: 1386.077 DF: 60 Prob>F: 0.0112 Dependent Variable: MEANLNG Test: F_8VS9 Numerator: 76.8370 DF: 1 F value: 0.0554 Denominator: 1386.077 DF: 60 Prob>F: 0.8147 \end{verbatim} \paragraph{Tasks G and H} Finally we test separately for MCG differences within Hanna and within Westar. \begin{verbatim} G_Hanaeq: test mu7=mu8=mu9=mu10=mu11=mu12; H_Westeq: test mu13=mu14=mu15=mu16=mu17=mu18; ------------------------------------------------------------------------------- E. Plant by MCG, Hanna-Westar subset 31 The REG Procedure Test G_Hanaeq Results for Dependent Variable meanlng Mean Source DF Square F Value Pr > F Numerator 5 3223.58717 2.33 0.0536 Denominator 60 1386.07689 ------------------------------------------------------------------------------- Test H_Westeq Results for Dependent Variable meanlng Mean Source DF Square F Value Pr > F Numerator 5 17889 12.91 <.0001 Denominator 60 1386.07689 \end{verbatim} There is evidence of differences in mean lesion length within Westar, but not Hanna. It makes sense to follow up with pairwise comparisons of the MCG means for just Westar, but first let's review what we've done so far, limiting the discussion to just the Hanna-Westar subset of the data. We've tested \begin{itemize} \item Overall difference among the 12 means \item Main effect for PLANT \item Main effect for MCG \item PLANT*MCG interaction \item 15 pairwise comparisons of the Hanna-Westar difference, following up the interaction \item One comparison of the 6 means for Hanna \item One comparison of the 6 means for Westar \end{itemize} That's 21 tests in all, and we really should do at least 15 more, testing for pairwise differences among the Westar means. Somehow, we should make this into a set of proper post-hoc tests, and correct for the fact that we've done a lot of them. But how? Tukey tests are only good for pairwise comparisons, and a Bonferroni correction is very ill-advised, since these tests were not all planned before seeing the data. This pretty much leaves us with Scheff\'e or nothing. \paragraph{Scheff\'e Tests} Because some of the tests we've done are for more than one contrast at a time, the discussion of Scheff\'e tests for \emph{collections} of contrasts in Section~\ref{SCHEFFECONTRASTS} (page~\pageref{SCHEFFECONTRASTS}) is relevant. But Section~\ref{SCHEFFECONTRASTS} is focused on the case where we are following up a significant difference among \emph{all} the treatment means. Here, the initial test may or may not be a test for equality of all the means. We might start somewhere else, like with a test for an interaction or main effect. It's a special case of Scheff\'e tests for regression (Section~\ref{SCHEFFEREGRESSION}, page~\pageref{SCHEFFEREGRESSION}). Assume a multifactor design. Create a combination explanatory variable whose values are all combinations of factor levels. All the tests we do will be tests for collections consisting of one or more contrasts of the cell means. Start with a statistically significant initial test, an $F$-test for $r$ contrasts. A Scheff\'e follow-up test will be a test for $s$ contrasts, not necessarily a subset of the contrasts of the initial test. The follow-up test must obey these rules: \begin{itemize} \item $s