STA 414/2104, Spring 2011, Assignment 2 discussion. I modified the original set of R functions for training a MLP to allow for a quadratic penalty on the weights, as described in the assignment handout. This required changing the mlp.train function to take the two penalty magnitudes (lambda1 and lambdas2) as arguments, and add the gradient of the penalty to the gradient of the log likelihood when updating the weights by gradient ascent. The mlp.cv function was also changed to take these penalty magnitudes as arguments and pass them on to mlp.train. (I slightly changed the messages output to shorten them.) The modified functions are in a2-mlp.txt. The final R commands to use these functions for this assignment are in the file a2-script.txt. Some of the commands in this file are the result of looking at earlier, preliminary, runs. The output of these final commands is in a2-script-out.txt. After some trial and error, I chose learning rates of eta1=0.0001 and eta2-0.0002. These learning rates produce stable increases in the log likelihood for almost all the runs done, but they are not necessarily the optimal choice. I then did a number of training runs, with different numbers of hidden units (M), and penalty magnitudes (lambda1 and lambda2), beginning with the combinations laid out in the assignment handout. Two runs were done for each combination, with weights initialized using random number seeds of 1 and 2. The number of gradient ascent iterations done varied, with the number chosen to be large enough so that it appeared that either the validation log probability had reached it maximum before the end of the run, and would not exceed it in future, or that the validation log probability had almost stabilized, and would not increase by very much if the run were continued. The network weights at the iteration with the highest log probability for the validation set was used to make predictions for test cases. This was done for all runs, since that was most convenient, but of course these test results cannot be used to pick the best network to use - only the results on the validation set can be used for that. The results with no penalty were as follows (results for the two seeds on two lines): #units total best best val test (M) lambda1 lambda2 iterations iteration log prob log prob 2 0 0 3000 2999 -0.230 -0.244 0 0 3000 3000 -0.231 -0.252 4 0 0 3000 701 -0.216 -0.242 0 0 3000 710 -0.212 -0.222 8 0 0 3000 584 -0.207 -0.231 0 0 3000 679 -0.203 -0.228 16 0 0 3000 716 -0.203 -0.227 0 0 3000 731 -0.195 -0.214 Based on validation log probability, I chose M=16 as the best number of hidden units, but M=8 is only slightly worse. A definitive conclusion would require doing runs with more than two random seeds. However, for runs with a non-zero penalty, one might expect a larger number of hidden units to perform better than when there is no penalty (since the penalty reduces overfitting, which is the potential problem with a large number of hidden untis), so choosing M=16 over M=8 is probably the right decision. The results with M=16 and various penalty magnitudes, as suggested in the handout, were as follows (results for the two seeds on two line): #units total best best val test (M) lambda1 lambda2 iterations iteration log prob log prob 16 1 1 4000 781 -0.201 -0.226 1 1 4000 775 -0.195 -0.213 16 3 3 4000 1031 -0.197 -0.224 3 3 4000 963 -0.193 -0.212 16 9 9 4000 3573 -0.186 -0.210 9 9 4000 2272 -0.179 -0.203 16 27 27 6000 6000 -0.190 -0.210 27 27 6000 6000 -0.189 -0.207 With a penalty, the best validation log probability is reached after a larger number of iterations. (For lambda1=lambda2=27, the maximum was not reached after 6000 iterations, but the rate of increase was slow by that point, so further training would probably have lead to only a small improvement.) The validation set log probability with M=16 and lambda1=lambda2=9 was quite a bit better than the best run with no penalty (-0.186 / -0.179 versus -0.203 / -0.195). I did several more runs as well. First, since M=8 had also given good results with no penalty, I tried M=8 with lambda1=lambda2=9: #units total best best val test (M) lambda1 lambda2 iterations iteration log prob log prob 8 9 9 3000 1952 -0.195 -0.220 9 9 3000 3000 -0.190 -0.190 The results are good, but not as good as with M=16. I then tried (with M=16) using different penalty magnitudes for the input-to-hidden weights (lambda1) and the hidden-to-output weights (lambda2), varying them around the values of lambda1=lambda2=9 that was best in the runs above: #units total best best val test (M) lambda1 lambda2 iterations iteration log prob log prob 16 6 12 4000 2409 -0.190 -0.212 6 12 4000 2812 -0.192 -0.216 16 12 6 4000 4000 -0.180 -0.202 12 6 4000 4000 -0.176 -0.210 16 16 4 4000 3903 -0.176 -0.207 16 4 4000 1542 -0.189 -0.210 The runs with lambda1=12 and lambda2=6 seem best. Based on validation results in the above runs, I would choose the run with seed 2 for M=16, lambda1=12, and lambda2=6. The test log probability with this choice is -0.210. This is slightly worse than the best test log probability from all runs above, which is -0.202, but there would be no apparent way of chosing that run based only on the training results. If only runs with no penalty are considered, the best choice based on validation log probability would be the run with seed 2 for M=16, for which the test log probability is -0.214. So using a penalty has produced a slight improvement in final performance on the test set. From looking at all the results, it seems that the improvement from using a penalty might usually be a bit larger - the non-penalty runs seem to be a bit "lucky" in getting a good test result from the best run according to validation results, and the runs with penalty seem to be a bit "unlucky" in that respect. I also tried logistic regression, using glm, which produced a log probability for the test set of -0.262. So the ability of an MLP network to produce a non-linear function of the inputs substantially improves performance.