USEFUL R COMMANDS FOR ASSIGNMENT 1 Written 2010-100-08. There may be updates later. HELP ON COMMANDS To get some help on the command called "foo", use help(foo) This may be useful in finding out some details that I don't mention here. READING THE DATA In all the data files for this assignment, the first line is a "header" with the names of the variables. You should therefore read it using read.table with the "head=TRUE" option. For example: d <- read.table("ass1-patients.txt",head=TRUE) Some of the data files also have names for the observations, which get printed when you display the data. For the read.table command above to work, you have to set the working directory to the one containing the ass1-patients.txt file (where ever you downloaded it). You can do this with the setwd command, or with a menu option if you're using MS Windows or a Mac. The result of read.table is a "data frame". You can access the individual variables in a data frame with, for instance, d$recur to get the "recur" variable from the data frame above. Note that the webpage provides different versions of the data files for Windows and for Unix/Linux/Mac. The Windows ones start with "w-", so unless you've changed that, the file name in the example above would be "w-ass1-patients.txt". LOOKING AT DATA Just typing the name of the variable you saved the data in (from read.table) will display the values of all the variables for all the observations. For some of the data files, this is a lot of numbers. So you you may want to look at subsets... SUBSETS OF DATA You can look at just a subset of the data by selecting certain rows or columns of a data frame or matrix. For instance, if d is a data frame with 10 variables and 20 observations, you can look at the following: d[1:10,] first 10 cases, all variables d[c(-8,-12),] 18 cases, dropping 8th and 12th, all variables d[,c(-1,-10)] 8 variables (dropping the first and last), all 20 cases d[c(1,15),1:5] just cases 1 and 15, and just the first 5 variables d[,c("MAL","BIR")] just the "MAL" and "BIR" variables, all cases If the name of the 4th variable is "xyz", it can be extracted using either d[,4] or d$xyz. DATA FRAMES VERSUS MATRICES The R "data frame" and "matrix" data types are similar - each can hold a set of observations (rows) with values for variables (columns). The read.table function gets you a data frame. If you want a matrix (sometimes necessary when using "lm"), you can say m <- as.matrix(d). If you have a matrix (say from "prcomp") and want a data frame (eg, for "plot"), use d <- as.data.frame(m). Note that d$xxx can be used to access variable "xxx" in a data frame, but not in a matrix (though d[,"xxx"] works for both). ARITHMETIC ON VECTORS, MATRICES, AND DATA FRAMES You can do arithmetic operations on whole vectors, matrices, or data frames. For example, if "m" is a matrix, log(m) will produce a matrix with elements that are the logs of the corresponding elements of m. SCATTERPLOTS You can make a scatterplot of two variables with a command like plot (d$KIL, d$CLO) To make pairwise scatterplots of all variables in a data frame, d, use plot (d) Of course, this isn't a good idea for the gene expression data with 1644 variables. If there are a lot of variables, the plots will be small, and will look better if the points are dots rather than circles. This can be done with the pch option: plot (d, pch=".") or for somewhat bigger dots: plot (d, pch=20) You can change colours of points with the "col" option, which should get a vector of colours, one for each observations that is plotted. For instance, plot (d[,1], d[,2], col=c("green","red")[recur+1]) plots observations in green if recur is 0 and in red if recur is 1. HISTOGRAMS The make a histogram of a variable, use a command like hist(d$MAL) or to set how many bins there are in the histogram, hist(d$MAL,nclass=30) QQ PLOTS To produce a QQ plot of variable "x" in data frame "d", versus standard normal quantiles, use qqnorm(d$x) SAVING AND PRINTING PLOTS You can view plots on the screen without doing anything special. But if you want to save them or print them, putting the plots in files is convenient. You can put a plot in PDF format in a file with pdf("myplot.pdf") # Make plots go to the file myplot.pdf plot(... whatever ...) # Create a plot dev.off() # Finish off. After, plots again go to the screen You can then print myplot.pdf using the Acrobat Reader program (freely downloadable), usually called acroread on Unix systems. Just visiting the file in your browser may also work. If you'd like smaller characters (allowing more plot to fit), you can use pdf("myplot.pdf",pointsize=9) To put more than one plot on a page, follow up the "pdf" command with a command such as par(mfrow=c(3,2)) which puts six plots on a page, arranged in 3 rows of 2. SAMPLE COVARIANCE AND CORRELATION You can find the covariance of two variables a command like cov (d$ROS, d$KIL) and similar for correlation (using "cor" instead of "cov"). The expression cov (d, d$KIL) will compute the covariances of all variables in d with the KIL variable. The expression cov (d) produces the whole covariance matrix of variables in d. It may be useful to round these before printing them, so they fit on the page or screen. Eg, round(cov(d),2) Note that to get things printed in a script, you need to use "print": print(round(cov(d),2)) SAMPLE MEANS / MEDIANS You can find the sample mean of each variable in a data frame or matrix (say d) with colMeans(d) Similarly, you can find the means of all variables for each observation using rowMeans(d) PRINCIPAL COMPONENT ANALYSIS PCA can be done using the buit-in prcomp function. It takes a matrix or data frame as its argument, and returns a list in which the element "rotation" has all the principal components (ie, the eigenvectors) as columns of a matrix, and "sdev" has the standard deviations in those directions (ie, the square roots of the eigenvalues). The "x" field has the projections of the observations on the principal components. For example, pc = prcomp(X) e1 = pc$rotation[,1] e2 = pc$rotation[,2] lambda1 = pc$sdev[1]^2 lambda2 = pc$sdev[2]^2 plot(pc$x[,1],pc$x[,2]) The last command produces a scatterplot of the observations reduced to just the first two principal components. Note that prcomp centres the data by default (ie, looks at the data after subtracting the sample means, as is standard). By default, prcomp does not scale the data to have variance one, but it can be made to do so with the scale=T option (this has the effect of looking at the sample correlation matrix rather than the sample covariance matrix). LINEAR REGRESSION Ordinary least-squares linear regression can be done in R using the lm function. Example: fit = lm (y ~ X + v) This fits a regression model for values in the vector y in terms of variables in X and v. If the response vector y has length 100, then we might use a matrix X with 100 rows and 4 columns, each column being a covariate, and a vector v of length 100 as another covariate. Of course, we could also just say lm(y~X) or lm(y~v) to use just the covariates in X or the covariate in v. Note that this doesn't work if X is a data frame (but you can convert the data frame to a matrix). The summary function can be used to look at the estimated coefficients, the Adjusted R-Squared value, and other things of interest: summary(fit) Note that in a script you will need to write print(summary(fit)) to see the results. You can get the vector of regression coefficients (starting with the intercept) with coef(fit).