DISCUSSION FOR STA 414/2104, ASSIGNMENT 1, 2012 Limits for K and lambda. As lambda goes to infinity, all the penalized least squares estimates for the beta coefficients will be forced to zero, except for beta_0, the intercept, which is not included in the penalty. The estimate for beta_0 will therefore go to the mean of the responses in the training set, since this is the value that minimizes the squared error. The linear model fit will be a horizontal line, and the residuals will just be the actual values of the responses minus beta_0. When we take the average of the K nearest residuals and then add in beta_0, we get the same result as if we had done the simple K nearest neighbor method. Once K has increased to the size of the training set, the average of the residuals for the K nearest neighbors will be the average of all the residuals, which will always be zero. This is because the intercept (beta_0) will adjust to minimize the sum of squared errors, and if the average of the residuals is not zero, the sum of squared errors could be made smaller by adjusting beta_0 so the average residual is zero. (Note that this would not be true if beta_0 was penalized.) Since the average of the residuals of neighbors is zero when K is as big as the training set, the result is the same as simple penalized least squares estimation. Furthermore, if lambda is zero, the result is the same as standard least squares estimation, and if lambda is infinite, the result is the same as just always predicting that the response for a test case will be the same as the mean of the responses for the training cases. These limits are summarized below: any lambda K=Inf Penalized least squares lambda=Inf any K K nearest neighbor lambda=0 K=Inf Simple unpenalized least squares lambda=Inf K=Inf Just predict the mean of the training responses Results on the artificial data. On the artificial data, K=4 and lambda=0 was best according to the cross validation assessment. However, larger values of lambda with K=4 were almost as good, even out to K=Inf, equivalent to simple nearest neighbor. However, K=Inf (ie, K set to the number of training cases, equivalent to just using the linear model) did not perform well in the cross validation assessment, as might be expected from the fact that the relationship of y to x does not seem to be close to a straight line. Results on the test set were similar to the cross validation assessment, except that K=8 was a bit better than K=4, and with K=8 large values of lambda were slightly better than small values rather than being slightly worse. We can summarize the test set performance (lower is better) using the best combination of K and lambda according to cross validation, and using some limiting values for K and lambda, as follows: best CV (K=4, lambda=0): 0.03590 least squares linear model: 0.04179 best CV nearest neighbor (K=4): 0.03598 mean of training responses: 0.06286 So on this artificial data set there seems to be no substantial benefit of using this hybrid of a linear model and nearest neighbor. Nearest neighbor alone does about as well. Results on the unscaled ozone data. On the ozone data without rescaling of input variables, K=16 and lambda=1000 was best according to the cross validation assessment. Smaller values of lambda with K=16 also looked quite good, and K=32 with lambda=1000 or less was only slightly worse. Setting K=Inf and lambda=0 (equivalent to simple unpenalized least squares) was not too bad, but does seem to be a bit worse than K=16 and lambda=1000 according to cross validation. The test set results were similar. Best was K=8 and lambda=1000, but the chosen values of K=16 and lambda=1000 were not much worse. The advantage over K=Inf was greater than was seen in the cross validation assessment. We can summarize the test set performance using the best combination of K and lambda according to cross validation, and using some limiting values for K and lambda, as follows: best CV (K=16, lambda=1000): 4.38482 best CV with lambda=0 (K=16): 4.43127 least squares linear model: 4.65152 best CV nearest neighbor (K=8): 5.42133 mean of training responses: 8.47231 Results on the scaled ozone data. Scaling the inputs to all have mean zero and variance one eliminates any effects of arbitrary choices of units, and might be expected to improve the performance both of nearest neighbor methods and of penalty methods (when the penalty isn't zero or infinity). The cross validation assessment favours K=16 and lambda=100, though the combinations (K=16, lambda=10), (K=16, lambda=1000), (K=8, lambda=100), (K=8, lambda=1000), (K=32, lambda=10), and (K=32, lambda=100) also look good. None of the combinations with lambda=0 look good, so it seems like a non-zero penalty is a good thing. The cross validation error with these combinations is substantially less than the cross validation error with any combination of K and lambda for the unscaled data, so if we were choosing whether to scale the data using cross validation, we would choose to use the scaled data. The test set results are similar to the cross validation assessment. The best combination is K=8 and lambda=100, but the chosen combination of K=16 and lambda=100 is only slightly worse. We can summarize the test set performance using the best combination of K and lambda according to cross validation, and using some limiting values for K and lambda, as follows: best CV (K=16, lambda=100): 4.18255 best CV with lambda=0 (K=16): 4.25541 least squares linear model: 4.65152 best CV nearest neighbor (K=8): 4.37762 mean of training responses: 8.47231 The results for least squares and for the mean of training responses are the same as for the unscaled data, since these methods are not sensitive to scaling. We can conclude that on the ozone data, there is a benefit to using this hybrid of a linear model and nearest neighbor. Estimating the linear model coefficients using a penalty also seems to provide a benefit.