USEFUL R COMMANDS FOR ASSIGNMENT 2 Written 2010-11-26. There may be updates later. Note that I don't repeat things you might need that are in the hints for Assignment 1. SORTING AND ORDERING You may find it useful to sort a vector in increasing order, which you can do with sort(v), or to find the vector of indexes which give this sorted vector. For instance, if M is an n by m matrix and v is a vector of dimension n, then M[order(v),] will give you the matrix M with rows rearranged in the same order that makes v be in increasing order. COMPARING AND COUNTING If v is a vector of numbers, the expression v>0.5 will be a vector of FALSE/TRUE indicators of whether each element of v is greater than 0.5 or not. The sum function adds up all the elements of a vector. If the vector is of FALSE/TRUE indicators rather than numbers, sum adds up the elements counting FALSE as 0 and TRUE as 1. So sum(v>0.5) gives you the number of elements of v that are greater than 0.5. SUBTRACTING A VALUE FROM EACH ROW If m is a matrix or data frame with n rows, and v is a vector of n elements, then m-v will be the matrix or data frame in which v[1] is subtracted from all the elements of the first row of m, v[2] is subtracted from all the elements of the second row of m, etc. LOGISTIC REGRESSION Logistic regression can be done with the "glm" function. This function actually many "generalized linear models", of which logistic regression is one, so you have to tell it to use logistic regression (ordinary least squares regression, as done by lm, is the default). Example: fit = glm (y ~ X + v, family=binomial) This fits a logistic regression model for 0/1 values in the vector y in terms of covariates in X and v. If the response vector y has length 100, then we might use a matrix X with 100 rows and 4 columns, each column being a covariate, and a vector v of length 100 as another covariate. Note that this doesn't work if X is a data frame (but you can convert the data frame to a matrix). To include squares of variables as extra covariates in a logistic regression, you can use a command like glm (y ~ x + I(x^2), family=binomial) You can't just put in x^2 without I(...) around it, since then R interprets the ^2 as meaning something other than squaring. The summary function can be used to look at the regression coefficients (which are estimated by maximum likelihood), and some other information: summary(fit) Note that in a script you will need to write print(summary(fit)) to see the results. You can get the vector of regression coefficients (starting with the intercept) with coef(fit). One of the things printed by summary is the "AIC" (Akaike Information Criterion). This is an attempt to combine the likelihood at the maximum and the number of covariates to get an indication of how good the model is. Lower AIC is better. You can find the probability (according to the fitted model) of the response being 1 for each observation with the command predict(fit,type="response") You can use this same function for a model fit by least squares with the lm function, except that the results aren't guaranteed to be legal probabilities (between 0 and 1). T TESTS The t.test function does various sorts of univariate t tests. For the assignment, you need two-sample t tests with variances for the two groups not assumed to be equal. This is done with print (t.test (x, y, var.equal=FALSE)) where x and y are vectors of observations in the two groups (not necessarily the same number in each group). Instead of printing the result of the test, you can just get the p-value with t.test (x, y, var.equal=FALSE) $ p.value FOR LOOPS For the assignment, you will need to do many t tests (for all 1644 genes). You'll want to use a loop for this, not write 1644 commands by hand. If X and Y are matrices with the same number (k) of columns, you can use the following commands to do tests with each column of X against the corresponding column of Y: p.values <- numeric(k) # a vector to hold all the p-values for (i in 1:k) # loop over all k columns of X and Y { p.values[i] <- t.test (X[,i], Y[,i], var.equal=FALSE) $ p.value } After this, the vector p.values will have the results of the k tests.