Hints on using R for Assignment 3


First, you need to read the data.  The data file for this assignments
contains names for both the columns (perv, alpha..., cdc15...) and
the lines (gene names).  R figures this out with the default options,
so you can just use 

   d <- read.table ("gene.txt")

assuming you stored the data file from the web page in your working
directory with the name gene.txt.  To use the data, you need to 
convert the data frame to a matrix, and then transpose it, so that
the columns (variables) correspond to the genes, and the row (cases) 
to the various experimental measurements (except that the first row
is the flag saying whether the gene was previously thought to vary
in activity over the cell cycle, which you should ignore until the
very end, as discussed in the assignment sheet).  So you might do
something like

   m <- t (as.matrix (d))

to produce a suitable matrix from the data frame.  You'll then need
extract the rows you need for each task.  Eg, m[2:19,] gets you the
measurements for the "alpha" experiment.  (Remember that the first
measurement is the "prev" flag, not to be used until the end!)  The
row and column names from the data frame are preserved in the matrix,
and often in results (eg, vectors) obtained using the matrix.
 
Next to get the functions in the pca.r file from the web page defined,
you need to say

   source("pca.r")

or replace "pca.r", by whatever file name you stored the downloaded
file under, in quotes.  You may need to give a full path, eg "c:/pca.r"
if you stored the file as "pca.r" in the top level of your c drive.

You use these functions by calling them like this:

   pcv <- pca.vectors (datamatrix, k, scale=TF)
   pcp <- pca.proj (pcv, datamatrix)

where datamatrix is the matrix of observations you want to find
principal components from (which will be some subset of the data 
for this assignment), k is the number of principal components you
want, and scale=TF should either be scale=TRUE or scale=FALSE.  If 
scale=TRUE, the variables are scaled by their standard deviation
before use, which is equivalent to doing PCA on the correlation
matrix, rather than the covariance matrix.  Once the principal
components are found with pca.vectors, you can find the projections
of the observations on these principal components with pca.proj.
The data matrix for pca.proj can contain observations that weren't
used in pca.vectors.  The result of pca.proj is a reduced data
set, with the number of variables being k rather than the original
number.

When producing plots for this assignment, you may find it useful
to have BOTH dots are the data points AND lines connectiong them.
The following plot options accomplish this:

   plot (stuff, type="b", pch=20)

You need to come up with some figure for how "cyclical" each gene is.
The principal component vectors (ie, eigenvectors) are useful for
this.  You get them from the result of pca.vectors as element "e" of
the result.  For instance, after the pca.vectors command above,
pcv$e[,2] will be the eigenvector for the second principal component.
You can do vector arithmetic on these to get some suitable measure
of how cyclic each gene is.  

Once you get such a measure, the "sort", "order", and "quantile"
functions may be useful in selecting a subset of about 800 genes that
seem to have cyclic activity.  Remember that you can select a subset
of a vector using logical operations:  eg, a[a>10] is the subset of
the vector a in which the elements are greater than 10.