STA 437 Assignment 2, Fall 2010 Solution

See the assignment handout here. The R commands used are here (Unix/Linux/Mac) or here (MS Windows).

This assignment relates to Part II of Assignment 1, a solution for which is here.

Part 1: logistic regression

A logistic regression model using the first five principal components produces results that correspond quite well with the results using least squares, as can be seen from the (edited) output below:
> print(summary(ls.fit))
                     Estimate Std. Error t value Pr(>|t|)
(Intercept)          0.589744   0.065148   9.052 1.84e-10 ***
pc.log$x[, keep]PC1  0.005793   0.002422   2.391   0.0226 *
pc.log$x[, keep]PC2 -0.018338   0.008927  -2.054   0.0479 *
pc.log$x[, keep]PC3  0.029741   0.012168   2.444   0.0200 *
pc.log$x[, keep]PC4  0.005701   0.012642   0.451   0.6549
pc.log$x[, keep]PC5  0.038643   0.013758   2.809   0.0083 **

> print(summary(logistic.fit))
                    Estimate Std. Error z value Pr(>|z|)
(Intercept)          0.60063    0.50987   1.178   0.2388
pc.log$x[, keep]PC1  0.03939    0.02292   1.719   0.0856 .
pc.log$x[, keep]PC2 -0.13266    0.08022  -1.654   0.0982 .
pc.log$x[, keep]PC3  0.23607    0.13007   1.815   0.0695 .
pc.log$x[, keep]PC4  0.04506    0.10139   0.444   0.6567
pc.log$x[, keep]PC5  0.26632    0.13109   2.032   0.0422 *

    Null deviance: 52.802  on 38  degrees of freedom
Residual deviance: 31.903  on 33  degrees of freedom
AIC: 43.903
Both models agree that there is not significant evidence that PC4 is useful in the regression (given the other PCs used).

This plot shows how the predictions from least squares and logistic regression compare. They are quite similar, except that logistic regression ``bends'' them so that they never go outside the range of 0 to 1, which also changes less extreme predictions somewhat.

The regression coefficients for least squares are all (except the intercept) about 0.13 times the regression coefficients for logistic regression. The derivative of the function 1/(1+exp(-v)), used by logistic regression to translate a linear combination, v, of the covariates to a probability, is 0.25 at v=0 (corresponding to a probability of 0.5). If all the probabilities were near 0.25, one might expect the least squares regression coefficients to be smaller by this factor of 0.25. But it seems that since some probabilities are near 0 or 1 the least squares coefficients are even smaller than this, to reduce the degree to which least squares produces predicitions much less than 0 or much greater than 1. Eliminating PC4 as a covariate gives the following logistic regression model:

> print(summary(glm(recur~pc.log$x[,c(1,2,3,5)],family=binomial)))
                             Estimate Std. Error z value Pr(>|z|)
(Intercept)                   0.53991    0.47450   1.138   0.2552
pc.log$x[, c(1, 2, 3, 5)]PC1  0.03981    0.02203   1.807   0.0708 .
pc.log$x[, c(1, 2, 3, 5)]PC2 -0.12800    0.07844  -1.632   0.1027
pc.log$x[, c(1, 2, 3, 5)]PC3  0.22263    0.12025   1.851   0.0641 .
pc.log$x[, c(1, 2, 3, 5)]PC5  0.27815    0.13180   2.110   0.0348 *

    Null deviance: 52.802  on 38  degrees of freedom
Residual deviance: 32.106  on 34  degrees of freedom
AIC: 42.106
That the AIC is smaller than when using all of the first five PCs confirms that PC4 is better left out of the model. The p-value for PC2 is now 0.1027, which might be regarded as non-significant, so one might try eliminating it as well. However, the AIC then goes up to 43.021, so it seems that PC1, PC2, PC3, and PC5 may be the best subset.

Part 2: t tests with Bonferroni correction

Testing significance of all 1644 genes for association with cancer recurrence using two-sample univariate t tests (variance not assumed equal) give the following numbers of rejections (ie, of significant genes) at various significance levels:
     level t-test rej Bonferroni rej
[1,] 0.100        989              0
[2,] 0.050        735              0
[3,] 0.020        384              0
[4,] 0.010        219              0
[5,] 0.005         89              0
The second column above is the number of rejections with no correction for multiple testing. The third column is with the Bonferroni correction (ie, multiplying all p-values by 1644), which leads to no rejections at all, even at a significance level of 0.1 (the smallest adjusted p-value is 0.608). Clearly, the adjustment has drastically reduced our ability to identify genes associated with recurrence.

Part 3: False Discovery Rates for t tests

Looking instead at the number of genes flagged as being possibly associated with recurrence at different levels for False Discovery Rate (FDR) gives the following results:
       FDR bound count est count
[1,] 0.100         612      1464
[2,] 0.050           0      1234
[3,] 0.020           0       676
[4,] 0.010           0         0
[5,] 0.005           0         0
Here, the second column contains the number of rejections when the FDR is bounded by the levels in the first column, using the total number of genes (1644) as the bound on the number of true null hypotheses. The second column uses the estimate of 313.3 true null hyptheses found by looking at the number of p-values greater than 0.7. Clearly, using this estimate for the number of true nulls allows many more genes t be identified as being possibly associated with recurrence. However, this estimate could be far from correct if there is dependence between the tests, so one should be cautious about results found using it. Even using the total number of genes as a bound on the number of true null hypotheses allows one to get more useful results than when using the Bonferroni corrections, if one is willing to accept an FDR bound of 0.1, since 612 genes are then selected as being possibly associated with recurrence.

The results here are a bit puzzling, however. If one looks at the histogram of the p-values, one sees a peak near zero, as expected if some null hypotheses are false, but the peak actually drops in the vicinity of zero, as seen in these plots. A consequence is that the estimated FDR when rejecting only a few genes is actually higher than when rejecting somewhat more genes.

Part 4: adjustment of data for each patient

After adjusting the data to subtract the overall mean of the log expression level for each patient from that patient's measurements, I redid the analyses above.

Logistic regression with the first five principal components found from this adjusted data gives a slightly smaller AIC then for the first five principal components of the unadjusted data:

> print(summary(glm(recur~norm.pc.log$x[,keep],family=binomial)))
                         Estimate Std. Error z value Pr(>|z|)
(Intercept)               0.68338    0.53140   1.286   0.1984
norm.pc.log$x[, keep]PC1 -0.20775    0.09042  -2.298   0.0216 *
norm.pc.log$x[, keep]PC2 -0.01389    0.10770  -0.129   0.8974
norm.pc.log$x[, keep]PC3  0.31444    0.14848   2.118   0.0342 *
norm.pc.log$x[, keep]PC4 -0.28574    0.15398  -1.856   0.0635 .
norm.pc.log$x[, keep]PC5 -0.05351    0.16688  -0.321   0.7485

    Null deviance: 52.802  on 38  degrees of freedom
Residual deviance: 30.064  on 33  degrees of freedom
AIC: 42.064
With these PCs, it seems that PC2 and PC5 might be eliminated, which gives the following model:
> print(summary(glm(recur~norm.pc.log$x[,c(1,3,4)],family=binomial)))
                               Estimate Std. Error z value Pr(>|z|)
(Intercept)                     0.64131    0.51571   1.244   0.2137
norm.pc.log$x[, c(1, 3, 4)]PC1 -0.19721    0.08112  -2.431   0.0151 *
norm.pc.log$x[, c(1, 3, 4)]PC3  0.29804    0.13861   2.150   0.0315 *
norm.pc.log$x[, c(1, 3, 4)]PC4 -0.27888    0.14847  -1.878   0.0603 .

    Null deviance: 52.802  on 38  degrees of freedom
Residual deviance: 30.168  on 35  degrees of freedom
AIC: 38.168
This produces a more substantial reduction in AIC compared to what seemed like the best model using the unadjusted data. This is evidence that the adjustment has improved the usefulness of the data.

The t tests for association of genes with recurrence using adjusted data come out as follows:

     level t-test rej Bonferroni rej
[1,] 0.100        517              4
[2,] 0.050        359              1
[3,] 0.020        239              0
[4,] 0.010        168              0
[5,] 0.005        103              0
There are now a few rejections even with the Bonferroni correction. Curiously, there are now fewer rejections without the Bonferroni correction at significance levels of 0.01 and higher, but there are more rejections at significance level 0.005. The estimated number of true null hypotheses from looking a p-values greater than 0.7 increases to 1040. The FDR results using the total number of genes as a bound on the number of true null hypotheses, and using this estimate of the number of true null hypotheses, are as follows:
       FDR bound count est count
[1,] 0.100         171       257
[2,] 0.050          33        97
[3,] 0.020           0         7
[4,] 0.010           0         0
[5,] 0.005           0         0
The number of genes identified with FDR bound of 0.1 has gone down from 612 to 171, but the number identified with FDR bound of 0.05 has gone up from 0 to 33. The numbers of genes identified using the estimated number of true null hypotheses has gone down at all FDR levels.

The previous results based on the estimate for the number of true null hypotheses using the unadjusted data seem to have been misleading, with too many hypotheses rejected at FDR levels of 0.02 and above (considering that after adjustment, the number of true null hypotheses is estimated at 1040, leaving only 604 false null hypotheses). Even the results on the unadjusted data using the FDR bound look like they may be inconsistent with the results on adjusted data.

This is not unexpected if the tests with the unadjusted data were not independent. We can now look at a possible source of dependence by seeing whether the average log expression level for each patient (which we subtracted when adjusting the data) is itself associated with recurrence. A two-sample t test of this gives a p-value of 0.04, so there is a substantial observed association with recurrence. If variation in this average results simply from variation in the measurement process, one would not expect it to be associated with recurrence. Perhaps there is something else going on with this patient-to-patient variation that explains this association, or perhaps there is no real association (a p-value of 0.04 is not compelling evidence that the null hypothesis is false).

Inclusion of this possibly spurious source of association with recurrence in all the gene measurements explains why with the unadjusted data there were more genes that seemed significant at moderate significance levels. This is also consistent with the number of genes with highly significant associations with recurrence being larger in the adjusted data, with this source of random variation removed. The adjustment also removes the puzzling drop in numbers of p-values very near to zero.

Part 5: logistic regression with quadratic terms

I used the adjusted data when investigating models with quadratic terms, since it seems to be better from the discussion above. We can begin with a logistic regression model with both the first five PCs and their squares as covariates:
> print(summary(glm(recur~norm.pc.log$x[,keep]+I(norm.pc.log$x[,keep]^2),
+                   family=binomial)))
                               Estimate Std. Error z value Pr(>|z|)
(Intercept)                   -0.213748   1.155742  -0.185   0.8533
norm.pc.log$x[, keep]PC1      -0.609584   0.313532  -1.944   0.0519 .
norm.pc.log$x[, keep]PC2      -0.127984   0.254836  -0.502   0.6155
norm.pc.log$x[, keep]PC3      -0.191525   0.293070  -0.654   0.5134
norm.pc.log$x[, keep]PC4      -0.186912   0.292449  -0.639   0.5227
norm.pc.log$x[, keep]PC5       0.156290   0.451152   0.346   0.7290
I(norm.pc.log$x[, keep]^2)PC1 -0.011171   0.026018  -0.429   0.6677
I(norm.pc.log$x[, keep]^2)PC2 -0.007251   0.023218  -0.312   0.7548
I(norm.pc.log$x[, keep]^2)PC3 -0.100200   0.069262  -1.447   0.1480
I(norm.pc.log$x[, keep]^2)PC4  0.005439   0.030953   0.176   0.8605
I(norm.pc.log$x[, keep]^2)PC5  0.362552   0.195782   1.852   0.0641 .

    Null deviance: 52.802  on 38  degrees of freedom
Residual deviance: 16.555  on 28  degrees of freedom
AIC: 38.555
The AIC is smaller than when just the five PCs were used, indicating that perhaps quadratic terms are useful. We had previously selected a subset of just PC1, PC3, and PC4 for use, so we can try including only these, along with all the squares:
> print(summary(glm(recur~norm.pc.log$x[,c(1,3,4)] + I(norm.pc.log$x[,keep]^2),
+                   family=binomial)))
                                Estimate Std. Error z value Pr(>|z|)
(Intercept)                     0.020475   1.126388   0.018   0.9855
norm.pc.log$x[, c(1, 3, 4)]PC1 -0.574085   0.290241  -1.978   0.0479 *
norm.pc.log$x[, c(1, 3, 4)]PC3 -0.117326   0.274481  -0.427   0.6691
norm.pc.log$x[, c(1, 3, 4)]PC4 -0.159480   0.269057  -0.593   0.5534
I(norm.pc.log$x[, keep]^2)PC1  -0.014016   0.020094  -0.697   0.4855
I(norm.pc.log$x[, keep]^2)PC2  -0.009299   0.021097  -0.441   0.6594
I(norm.pc.log$x[, keep]^2)PC3  -0.092146   0.070892  -1.300   0.1937
I(norm.pc.log$x[, keep]^2)PC4   0.005241   0.027456   0.191   0.8486
I(norm.pc.log$x[, keep]^2)PC5   0.335011   0.170798   1.961   0.0498 *

    Null deviance: 52.802  on 38  degrees of freedom
Residual deviance: 17.050  on 30  degrees of freedom
AIC: 35.05
This gives a further reduction in AIC. The p-values suggest that perhaps we should retain only PC1 and the squares of PC3 and PC5:
> print(summary(glm(recur~norm.pc.log$x[,1] + I(norm.pc.log$x[,c(3,5)]^2),
+                   family=binomial)))
                                 Estimate Std. Error z value Pr(>|z|)
(Intercept)                       0.43047    0.81140   0.531  0.59575
norm.pc.log$x[, 1]               -0.51577    0.19569  -2.636  0.00840 **
I(norm.pc.log$x[, c(3, 5)]^2)PC3 -0.09265    0.03661  -2.531  0.01139 *
I(norm.pc.log$x[, c(3, 5)]^2)PC5  0.25330    0.09664   2.621  0.00876 **

    Null deviance: 52.802  on 38  degrees of freedom
Residual deviance: 18.824  on 35  degrees of freedom
AIC: 26.824
This gives a substantial reduction in AIC. It seems like this may be the best model we can get from this data.

One might speculate that the usefulness of quadratic terms could indicate that for some genes cancer recurrence is not related to whether the level of gene expression is high or low, but rather to whether it is different in either direction from a more normal level of expression, which could come about because normal mechanisms of gene regulation have been lost in these cells.