Data set 1. On this dataset, I found that a learning rate of 0.0002 for 30000 iterations worked well. Results with two random seeds are plotted, producing similar results. The one-dimensional structure of the data is captured well. The auto-encoder output for each data point is near the closest point that could be produced (though the reconstruction is not perfect in this respect, with some points reconstructed with larger error than seems necessary). Data set 2. Although this this data set is similar to data set 1, I found that a smaller learning rate of 0.00005 was needed for stability; I increased the number of iterations to 120000 to compensate for the smaller learning rate. Results with two random seeds are plotted. Results for both show that the one-dimensional structure is captured well on the right side of the plot (x1 greater than -0.5), but not so well on the left side. The first seed produces worse results, with what looks like a spurious one-dimensiona structure. The second seed gives better results, but there is still a large error in the reconstruction of some points. Data set 3 (zip code images). For this data, I used a learning rate of 0.000025, for 50000 iterations, trying two random seeds. Some instability is still apparent in the later part of each run, but it doesn't look serious. For both runs, the values of the bottleneck units for the digits 0, 1, 2, and 6 seem to be sufficient to usually distinguish them from the other digits. However, 4 and 9 are mixed together, as are 3 and 8, The 5s are perhaps better separated from the other digits in the first run than in the second run (even though the 5s are quite near other digits in the first run, they seem to not overlap as much). Some of the 7s are well-separated from other digits, but not all of them. Using more than two bottleneck units might improve the digit separations. (The main reason for using two in this exercise is to allow for an easily interpretable 2D plot.) It's also possible that longer training time and/or more hidden units in layers 1 and 3 might help. Finally, the images in the actual MNIST dataset have twice the resolution and there are 60000 training cases rather than 600, which ought to help.