From this, it seems reasonable to use the logs of the measurements, though below I also try using the measurements without taking logs.
I concluded that there is no compelling reason to delete any genes or subjects as outliers.
PC1 0.31 PC2 -0.29 PC3 0.30 PC4 0.11 PC5 0.37 PC6 0.08 PC7 0.23 PC8 0.01 PC9 0.24 PC10 -0.03PC1, PC2, PC3, and PC5 have the highest correlation.
Pairwise scatterplots for PC1 to PC5 with recur marked by colour (1=red) as shown here. Visually, there seems to be some information about recurrence in pairs such as PC1 and PC3, though the patients with recurrence are certainly not clearly separated.
(These pairwise scatterplots also show a couple points that might be regarded as outliers - those with the smallest projections on PC1 and PC5 - but their distance does not seem extreme enough to remove them.)
I did least squares regression for recur with the projections on the first five principal components as covariates, with principal components found in the four ways above. The adjusted R-Squared values were as follows:
covariance, no logs: 0.1645 correlation, no logs: 0.1041 covariance, logs: 0.3334 correlation, logs: 0.3315
I therefore looked at the principal components found from the covariance matrix of the log expression levels in the remainder of the analysis.
The output for this regression was as follows:
Estimate Std. Error t value Pr(>|t|) (Intercept) 0.589744 0.065148 9.052 1.84e-10 *** pc.log$x[, keep]PC1 0.005793 0.002422 2.391 0.0226 * pc.log$x[, keep]PC2 -0.018338 0.008927 -2.054 0.0479 * pc.log$x[, keep]PC3 0.029741 0.012168 2.444 0.0200 * pc.log$x[, keep]PC4 0.005701 0.012642 0.451 0.6549 pc.log$x[, keep]PC5 0.038643 0.013758 2.809 0.0083 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4068 on 33 degrees of freedom Multiple R-squared: 0.4211, Adjusted R-squared: 0.3334 F-statistic: 4.801 on 5 and 33 DF, p-value: 0.002087The overall p-value of 0.002087 indicates strong evidence that there is real predictive value in these covariates.
However, one might question the normality assumption behind this p-value. I therefore did regressions with 999 random permutations of the "recur" variable, to see how unusual the adjusted R-squared value of 0.3334 found above is when there is no real relationship. This provided a p-value of 0.002, confirming (without making a normality assumption) the p-value from the output above.
Since the p-value for PC4 in the output above is large, one might consider a regression model using only PC1, PC2, PC3, and PC5. Here is the output for this:
Estimate Std. Error t value Pr(>|t|) (Intercept) 0.589744 0.064380 9.160 1.05e-10 *** pc.log$x[, c(1, 2, 3, 5)]PC1 0.005793 0.002394 2.420 0.02101 * pc.log$x[, c(1, 2, 3, 5)]PC2 -0.018338 0.008821 -2.079 0.04525 * pc.log$x[, c(1, 2, 3, 5)]PC3 0.029741 0.012025 2.473 0.01854 * pc.log$x[, c(1, 2, 3, 5)]PC5 0.038643 0.013596 2.842 0.00752 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4021 on 34 degrees of freedom Multiple R-squared: 0.4175, Adjusted R-squared: 0.349 F-statistic: 6.093 on 4 and 34 DF, p-value: 0.0008277Adjusted R-squared increases to 0.349. This seems like the best model obtainable when using this approach.