STA 437 Assignment 1, Fall 2010, Solution to Part II

See the assignment handout here. The R commands used are here (Unix/Linux/Mac) or here (MS Windows).

Should we do a log transformation?

Histograms of all the gene expression meansurements and their logs, and of logs of the gene expression measurements for a few arbitrarily selected genes are here. One can see that without taking logs, the distribution of gene expression measurements is highly skewed to the right. After taking logs, the overall distribution seems close to normal, except that the tails seem a bit heavy. The distributions of the log measurements for the individual genes examined also seem normal, though with only 39 measurements one cannot draw any firm conclusions.

From this, it seems reasonable to use the logs of the measurements, though below I also try using the measurements without taking logs.

Any outliers?

To look for outliers, I plotted here the means and standard deviations over subjects of the log expression levels for each gene, and the means and standard deviations over genes of the log expression levels for each subject. One can see a couple dozen genes with unusually high average expression, and one with unusually low average expression. It's not clear that any of these genes should be regarded as outliers, however - the distributions may just be heavy tailed. Some genes have high standard deviations, but not so high as to indicate that there are really extreme measurements. There are no subjects with extreme means or standard deviations for expression levels.

I concluded that there is no compelling reason to delete any genes or subjects as outliers.

Finding principal components

I found principal components in four ways - with and without taking logs, and using the covariance matrix and the correlation matrix. Screeplots of the first 20 principal components found by these four methods are here. There seems to be little difference between using the covariance and the correlation matrix, as one might expect, since as seen earlier differences in variances between the variables are not huge (at least if logs are taken). Using logs of expression levels produces a first principal component with a larger proportion of the variance.

Predicting cancer recurrence

To get an idea whether recurrence can be predicted from the first few principal components, I looked at their correlation with the "recur" indicator. For principal components found from the covariance matrix of the log expression levels, the correlations are as follows:
PC1   0.31
PC2  -0.29
PC3   0.30
PC4   0.11
PC5   0.37
PC6   0.08
PC7   0.23
PC8   0.01
PC9   0.24
PC10 -0.03
PC1, PC2, PC3, and PC5 have the highest correlation.

Pairwise scatterplots for PC1 to PC5 with recur marked by colour (1=red) as shown here. Visually, there seems to be some information about recurrence in pairs such as PC1 and PC3, though the patients with recurrence are certainly not clearly separated.

(These pairwise scatterplots also show a couple points that might be regarded as outliers - those with the smallest projections on PC1 and PC5 - but their distance does not seem extreme enough to remove them.)

I did least squares regression for recur with the projections on the first five principal components as covariates, with principal components found in the four ways above. The adjusted R-Squared values were as follows:

  covariance, no logs:  0.1645
  correlation, no logs: 0.1041
  covariance, logs:     0.3334
  correlation, logs:    0.3315

I therefore looked at the principal components found from the covariance matrix of the log expression levels in the remainder of the analysis.

The output for this regression was as follows:

                     Estimate Std. Error t value Pr(>|t|)
(Intercept)          0.589744   0.065148   9.052 1.84e-10 ***
pc.log$x[, keep]PC1  0.005793   0.002422   2.391   0.0226 *
pc.log$x[, keep]PC2 -0.018338   0.008927  -2.054   0.0479 *
pc.log$x[, keep]PC3  0.029741   0.012168   2.444   0.0200 *
pc.log$x[, keep]PC4  0.005701   0.012642   0.451   0.6549
pc.log$x[, keep]PC5  0.038643   0.013758   2.809   0.0083 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4068 on 33 degrees of freedom
Multiple R-squared: 0.4211,     Adjusted R-squared: 0.3334
F-statistic: 4.801 on 5 and 33 DF,  p-value: 0.002087
The overall p-value of 0.002087 indicates strong evidence that there is real predictive value in these covariates.

However, one might question the normality assumption behind this p-value. I therefore did regressions with 999 random permutations of the "recur" variable, to see how unusual the adjusted R-squared value of 0.3334 found above is when there is no real relationship. This provided a p-value of 0.002, confirming (without making a normality assumption) the p-value from the output above.

Since the p-value for PC4 in the output above is large, one might consider a regression model using only PC1, PC2, PC3, and PC5. Here is the output for this:

                              Estimate Std. Error t value Pr(>|t|)
(Intercept)                   0.589744   0.064380   9.160 1.05e-10 ***
pc.log$x[, c(1, 2, 3, 5)]PC1  0.005793   0.002394   2.420  0.02101 *
pc.log$x[, c(1, 2, 3, 5)]PC2 -0.018338   0.008821  -2.079  0.04525 *
pc.log$x[, c(1, 2, 3, 5)]PC3  0.029741   0.012025   2.473  0.01854 *
pc.log$x[, c(1, 2, 3, 5)]PC5  0.038643   0.013596   2.842  0.00752 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4021 on 34 degrees of freedom
Multiple R-squared: 0.4175,     Adjusted R-squared: 0.349
F-statistic: 6.093 on 4 and 34 DF,  p-value: 0.0008277
Adjusted R-squared increases to 0.349. This seems like the best model obtainable when using this approach.