STA 414/2104, Spring 2014, Assignment 2 discussion. With K=1, which is equivalent to a naive Bayes model, the classification error rate on test cases was 0.190. With K=5, 80 iterations of EM seemed sufficient for all ten random initializations. The resulting models had the following error rates on the test cases: 0.157 0.151 0.158 0.156 0.166 0.162 0.163 0.159 0.158 0.153 These are all better than the naive Bayes result, showing that using more than one mixture component for each digit is beneficial. I used the "show_digit" function to display the theta parameters of the 50 mixture components as pictures (for the run started with the last random seed). It is clear that the five components for each digit have generally captured reasonable variations in writing style, except perhaps for a few with small mixing proportion (given as the number above the plot), such as the second "1" from the top. Using the ensemble predictions (averaging probabilities of digits over the ten runs above), the classification error rate on test cases was 0.139. This is substantially better than the error rate from every one of the individual runs, showing the benefits of using an ensemble when there is substantial random variation in the results. Note that the individual run with highest log likelihood (and also highest log likelihood + penalty) was the sixth run, whose error rate of 0.162 was actually the third worst. So at least in this example, picking a single run based on log likelihood would certainly not do better than using the ensemble.