Hints on using R for Assignment 3 First, you need to read the data. The data file for this assignments contains names for both the columns (perv, alpha..., cdc15...) and the lines (gene names). R figures this out with the default options, so you can just use d <- read.table ("gene.txt") assuming you stored the data file from the web page in your working directory with the name gene.txt. To use the data, you need to convert the data frame to a matrix, and then transpose it, so that the columns (variables) correspond to the genes, and the row (cases) to the various experimental measurements (except that the first row is the flag saying whether the gene was previously thought to vary in activity over the cell cycle, which you should ignore until the very end, as discussed in the assignment sheet). So you might do something like m <- t (as.matrix (d)) to produce a suitable matrix from the data frame. You'll then need extract the rows you need for each task. Eg, m[2:19,] gets you the measurements for the "alpha" experiment. (Remember that the first measurement is the "prev" flag, not to be used until the end!) The row and column names from the data frame are preserved in the matrix, and often in results (eg, vectors) obtained using the matrix. Next to get the functions in the pca.r file from the web page defined, you need to say source("pca.r") or replace "pca.r", by whatever file name you stored the downloaded file under, in quotes. You may need to give a full path, eg "c:/pca.r" if you stored the file as "pca.r" in the top level of your c drive. You use these functions by calling them like this: pcv <- pca.vectors (datamatrix, k, scale=TF) pcp <- pca.proj (pcv, datamatrix) where datamatrix is the matrix of observations you want to find principal components from (which will be some subset of the data for this assignment), k is the number of principal components you want, and scale=TF should either be scale=TRUE or scale=FALSE. If scale=TRUE, the variables are scaled by their standard deviation before use, which is equivalent to doing PCA on the correlation matrix, rather than the covariance matrix. Once the principal components are found with pca.vectors, you can find the projections of the observations on these principal components with pca.proj. The data matrix for pca.proj can contain observations that weren't used in pca.vectors. The result of pca.proj is a reduced data set, with the number of variables being k rather than the original number. When producing plots for this assignment, you may find it useful to have BOTH dots are the data points AND lines connectiong them. The following plot options accomplish this: plot (stuff, type="b", pch=20) You need to come up with some figure for how "cyclical" each gene is. The principal component vectors (ie, eigenvectors) are useful for this. You get them from the result of pca.vectors as element "e" of the result. For instance, after the pca.vectors command above, pcv$e[,2] will be the eigenvector for the second principal component. You can do vector arithmetic on these to get some suitable measure of how cyclic each gene is. Once you get such a measure, the "sort", "order", and "quantile" functions may be useful in selecting a subset of about 800 genes that seem to have cyclic activity. Remember that you can select a subset of a vector using logical operations: eg, a[a>10] is the subset of the vector a in which the elements are greater than 10.