This assignment relates to Part II of Assignment 1, a solution for which is here.

> print(summary(ls.fit)) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.589744 0.065148 9.052 1.84e-10 *** pc.log$x[, keep]PC1 0.005793 0.002422 2.391 0.0226 * pc.log$x[, keep]PC2 -0.018338 0.008927 -2.054 0.0479 * pc.log$x[, keep]PC3 0.029741 0.012168 2.444 0.0200 * pc.log$x[, keep]PC4 0.005701 0.012642 0.451 0.6549 pc.log$x[, keep]PC5 0.038643 0.013758 2.809 0.0083 ** > print(summary(logistic.fit)) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.60063 0.50987 1.178 0.2388 pc.log$x[, keep]PC1 0.03939 0.02292 1.719 0.0856 . pc.log$x[, keep]PC2 -0.13266 0.08022 -1.654 0.0982 . pc.log$x[, keep]PC3 0.23607 0.13007 1.815 0.0695 . pc.log$x[, keep]PC4 0.04506 0.10139 0.444 0.6567 pc.log$x[, keep]PC5 0.26632 0.13109 2.032 0.0422 * Null deviance: 52.802 on 38 degrees of freedom Residual deviance: 31.903 on 33 degrees of freedom AIC: 43.903Both models agree that there is not significant evidence that PC4 is useful in the regression (given the other PCs used).

This plot shows how the predictions from least squares and logistic regression compare. They are quite similar, except that logistic regression ``bends'' them so that they never go outside the range of 0 to 1, which also changes less extreme predictions somewhat.

The regression coefficients for least squares are all (except the intercept) about 0.13 times the regression coefficients for logistic regression. The derivative of the function 1/(1+exp(-v)), used by logistic regression to translate a linear combination, v, of the covariates to a probability, is 0.25 at v=0 (corresponding to a probability of 0.5). If all the probabilities were near 0.25, one might expect the least squares regression coefficients to be smaller by this factor of 0.25. But it seems that since some probabilities are near 0 or 1 the least squares coefficients are even smaller than this, to reduce the degree to which least squares produces predicitions much less than 0 or much greater than 1. Eliminating PC4 as a covariate gives the following logistic regression model:

> print(summary(glm(recur~pc.log$x[,c(1,2,3,5)],family=binomial))) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.53991 0.47450 1.138 0.2552 pc.log$x[, c(1, 2, 3, 5)]PC1 0.03981 0.02203 1.807 0.0708 . pc.log$x[, c(1, 2, 3, 5)]PC2 -0.12800 0.07844 -1.632 0.1027 pc.log$x[, c(1, 2, 3, 5)]PC3 0.22263 0.12025 1.851 0.0641 . pc.log$x[, c(1, 2, 3, 5)]PC5 0.27815 0.13180 2.110 0.0348 * Null deviance: 52.802 on 38 degrees of freedom Residual deviance: 32.106 on 34 degrees of freedom AIC: 42.106That the AIC is smaller than when using all of the first five PCs confirms that PC4 is better left out of the model. The p-value for PC2 is now 0.1027, which might be regarded as non-significant, so one might try eliminating it as well. However, the AIC then goes up to 43.021, so it seems that PC1, PC2, PC3, and PC5 may be the best subset.

level t-test rej Bonferroni rej [1,] 0.100 989 0 [2,] 0.050 735 0 [3,] 0.020 384 0 [4,] 0.010 219 0 [5,] 0.005 89 0The second column above is the number of rejections with no correction for multiple testing. The third column is with the Bonferroni correction (ie, multiplying all p-values by 1644), which leads to no rejections at all, even at a significance level of 0.1 (the smallest adjusted p-value is 0.608). Clearly, the adjustment has drastically reduced our ability to identify genes associated with recurrence.

FDR bound count est count [1,] 0.100 612 1464 [2,] 0.050 0 1234 [3,] 0.020 0 676 [4,] 0.010 0 0 [5,] 0.005 0 0Here, the second column contains the number of rejections when the FDR is bounded by the levels in the first column, using the total number of genes (1644) as the bound on the number of true null hypotheses. The second column uses the estimate of 313.3 true null hyptheses found by looking at the number of p-values greater than 0.7. Clearly, using this estimate for the number of true nulls allows many more genes t be identified as being possibly associated with recurrence. However, this estimate could be far from correct if there is dependence between the tests, so one should be cautious about results found using it. Even using the total number of genes as a bound on the number of true null hypotheses allows one to get more useful results than when using the Bonferroni corrections, if one is willing to accept an FDR bound of 0.1, since 612 genes are then selected as being possibly associated with recurrence.

The results here are a bit puzzling, however. If one looks at the histogram of the p-values, one sees a peak near zero, as expected if some null hypotheses are false, but the peak actually drops in the vicinity of zero, as seen in these plots. A consequence is that the estimated FDR when rejecting only a few genes is actually higher than when rejecting somewhat more genes.

Logistic regression with the first five principal components found from this adjusted data gives a slightly smaller AIC then for the first five principal components of the unadjusted data:

> print(summary(glm(recur~norm.pc.log$x[,keep],family=binomial))) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.68338 0.53140 1.286 0.1984 norm.pc.log$x[, keep]PC1 -0.20775 0.09042 -2.298 0.0216 * norm.pc.log$x[, keep]PC2 -0.01389 0.10770 -0.129 0.8974 norm.pc.log$x[, keep]PC3 0.31444 0.14848 2.118 0.0342 * norm.pc.log$x[, keep]PC4 -0.28574 0.15398 -1.856 0.0635 . norm.pc.log$x[, keep]PC5 -0.05351 0.16688 -0.321 0.7485 Null deviance: 52.802 on 38 degrees of freedom Residual deviance: 30.064 on 33 degrees of freedom AIC: 42.064With these PCs, it seems that PC2 and PC5 might be eliminated, which gives the following model:

> print(summary(glm(recur~norm.pc.log$x[,c(1,3,4)],family=binomial))) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.64131 0.51571 1.244 0.2137 norm.pc.log$x[, c(1, 3, 4)]PC1 -0.19721 0.08112 -2.431 0.0151 * norm.pc.log$x[, c(1, 3, 4)]PC3 0.29804 0.13861 2.150 0.0315 * norm.pc.log$x[, c(1, 3, 4)]PC4 -0.27888 0.14847 -1.878 0.0603 . Null deviance: 52.802 on 38 degrees of freedom Residual deviance: 30.168 on 35 degrees of freedom AIC: 38.168This produces a more substantial reduction in AIC compared to what seemed like the best model using the unadjusted data. This is evidence that the adjustment has improved the usefulness of the data.

The t tests for association of genes with recurrence using adjusted data come out as follows:

level t-test rej Bonferroni rej [1,] 0.100 517 4 [2,] 0.050 359 1 [3,] 0.020 239 0 [4,] 0.010 168 0 [5,] 0.005 103 0There are now a few rejections even with the Bonferroni correction. Curiously, there are now

FDR bound count est count [1,] 0.100 171 257 [2,] 0.050 33 97 [3,] 0.020 0 7 [4,] 0.010 0 0 [5,] 0.005 0 0The number of genes identified with FDR bound of 0.1 has gone down from 612 to 171, but the number identified with FDR bound of 0.05 has gone up from 0 to 33. The numbers of genes identified using the estimated number of true null hypotheses has gone down at all FDR levels.

The previous results based on the estimate for the number of true null hypotheses using the unadjusted data seem to have been misleading, with too many hypotheses rejected at FDR levels of 0.02 and above (considering that after adjustment, the number of true null hypotheses is estimated at 1040, leaving only 604 false null hypotheses). Even the results on the unadjusted data using the FDR bound look like they may be inconsistent with the results on adjusted data.

This is not unexpected if the tests with the unadjusted data were not
independent. We can now look at a possible source of dependence by
seeing whether the *average* log expression level for each
patient (which we subtracted when adjusting the data) is itself
associated with recurrence. A two-sample t test of this gives a
p-value of 0.04, so there is a substantial observed association with
recurrence. If variation in this average results simply from
variation in the measurement process, one would not expect it to be
associated with recurrence. Perhaps there is something else going on
with this patient-to-patient variation that explains this association,
or perhaps there is no real association (a p-value of 0.04 is not
compelling evidence that the null hypothesis is false).

Inclusion of this possibly spurious source of association with recurrence in all the gene measurements explains why with the unadjusted data there were more genes that seemed significant at moderate significance levels. This is also consistent with the number of genes with highly significant associations with recurrence being larger in the adjusted data, with this source of random variation removed. The adjustment also removes the puzzling drop in numbers of p-values very near to zero.

> print(summary(glm(recur~norm.pc.log$x[,keep]+I(norm.pc.log$x[,keep]^2), + family=binomial))) Estimate Std. Error z value Pr(>|z|) (Intercept) -0.213748 1.155742 -0.185 0.8533 norm.pc.log$x[, keep]PC1 -0.609584 0.313532 -1.944 0.0519 . norm.pc.log$x[, keep]PC2 -0.127984 0.254836 -0.502 0.6155 norm.pc.log$x[, keep]PC3 -0.191525 0.293070 -0.654 0.5134 norm.pc.log$x[, keep]PC4 -0.186912 0.292449 -0.639 0.5227 norm.pc.log$x[, keep]PC5 0.156290 0.451152 0.346 0.7290 I(norm.pc.log$x[, keep]^2)PC1 -0.011171 0.026018 -0.429 0.6677 I(norm.pc.log$x[, keep]^2)PC2 -0.007251 0.023218 -0.312 0.7548 I(norm.pc.log$x[, keep]^2)PC3 -0.100200 0.069262 -1.447 0.1480 I(norm.pc.log$x[, keep]^2)PC4 0.005439 0.030953 0.176 0.8605 I(norm.pc.log$x[, keep]^2)PC5 0.362552 0.195782 1.852 0.0641 . Null deviance: 52.802 on 38 degrees of freedom Residual deviance: 16.555 on 28 degrees of freedom AIC: 38.555The AIC is smaller than when just the five PCs were used, indicating that perhaps quadratic terms are useful. We had previously selected a subset of just PC1, PC3, and PC4 for use, so we can try including only these, along with all the squares:

> print(summary(glm(recur~norm.pc.log$x[,c(1,3,4)] + I(norm.pc.log$x[,keep]^2), + family=binomial))) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.020475 1.126388 0.018 0.9855 norm.pc.log$x[, c(1, 3, 4)]PC1 -0.574085 0.290241 -1.978 0.0479 * norm.pc.log$x[, c(1, 3, 4)]PC3 -0.117326 0.274481 -0.427 0.6691 norm.pc.log$x[, c(1, 3, 4)]PC4 -0.159480 0.269057 -0.593 0.5534 I(norm.pc.log$x[, keep]^2)PC1 -0.014016 0.020094 -0.697 0.4855 I(norm.pc.log$x[, keep]^2)PC2 -0.009299 0.021097 -0.441 0.6594 I(norm.pc.log$x[, keep]^2)PC3 -0.092146 0.070892 -1.300 0.1937 I(norm.pc.log$x[, keep]^2)PC4 0.005241 0.027456 0.191 0.8486 I(norm.pc.log$x[, keep]^2)PC5 0.335011 0.170798 1.961 0.0498 * Null deviance: 52.802 on 38 degrees of freedom Residual deviance: 17.050 on 30 degrees of freedom AIC: 35.05This gives a further reduction in AIC. The p-values suggest that perhaps we should retain only PC1 and the squares of PC3 and PC5:

> print(summary(glm(recur~norm.pc.log$x[,1] + I(norm.pc.log$x[,c(3,5)]^2), + family=binomial))) Estimate Std. Error z value Pr(>|z|) (Intercept) 0.43047 0.81140 0.531 0.59575 norm.pc.log$x[, 1] -0.51577 0.19569 -2.636 0.00840 ** I(norm.pc.log$x[, c(3, 5)]^2)PC3 -0.09265 0.03661 -2.531 0.01139 * I(norm.pc.log$x[, c(3, 5)]^2)PC5 0.25330 0.09664 2.621 0.00876 ** Null deviance: 52.802 on 38 degrees of freedom Residual deviance: 18.824 on 35 degrees of freedom AIC: 26.824This gives a substantial reduction in AIC. It seems like this may be the best model we can get from this data.

One might speculate that the usefulness of quadratic terms could indicate that for some genes cancer recurrence is not related to whether the level of gene expression is high or low, but rather to whether it is different in either direction from a more normal level of expression, which could come about because normal mechanisms of gene regulation have been lost in these cells.