STA 437/1005, 2009, Solution to Assignment #2, Dataset #1

STA 437/1005, 2009, Assignment #2, Dataset #2

Radford M. Neal, 2009.

Here is a ``model soluton'' for Assignment #2, Data set #2. This assignment involves many judgement calls about how best to look at the data, and what to conclude from what you see. The methods and conclusions here are not the only reasonable ones. So if your solution is not too close to mine, it is not necessarily wrong. However, there certainly are many wrong methods, and many wrong conclusions that one could come to - not everything is right!

The R commands used to analyse this dataset are here.

After reading the data, I separated the class variable from the other 36 variables, creating a vector with the class indicators, and a data frame with just the other 36 variables. I also created three data frames containing these 36 variables for observations in each class, separately.

I then looked at the histograms for the observations in each of the four spectral bands at the centre pixel (E1, E2, E3, E4) for each of the three classes. I set the range on the horizontal axis to be the same for each class, so that they can more easily be compared. One can see a very large degree of right skew in the distribution for E1 and E2 in class 2, and a large degree of left skew in the distribution for E4 in class 2. In class 1, E1, E3, and E4 appear to be a bit skewed to the left. Class 3 has distributions that look more symmetrical, but which appear to have somewhat heavy tails.

I tried to improve the normality of these distributions by power transformations, but this is difficult, since the same transformation needs to be used for all classes. After trying a few possibilities, raising E1 and E2 to the power 0.2, E3 to the power 1.2, and E4 to the power 2.0 seemed to produce the best results. However, with these transformations, some of the histograms still do not look very close to normal distributions. I conclude that it may not be possible to make the distributions within all the classes be normal. I kept the variables in their original form for the rest of the analysis.

I next looked at pairwise scatterplots of the centre pixel values (E1, E2, E3, E4), for all classes (with class identified by colour) and for each class separately. I jittered each point to avoid overlap, since the numbers are integers from 0 to 255, and sometimes more than one observation has exactly the same pair of values. These plots are here. One can see that the classes are fairly well separated, so one could expect to be able to classify future observations reasonably accurately. A small number of observations in class 1 seem like they are outliers (eg, looking at the plot of E2 versus E4). I did not remove these from the dataset, however. We might hope that looking at all nine pixels would reduce the effect of any strange value in one pixel.

Finally, I computed the means and medians over all nine pixels for each spectral band, and looked at pairwise scatterplots of the four means and the four medians, with class identified by colour. I had expected that the median might work better, since it would be insensitive to an extreme value in one pixel, but actually the means seem to be at least as good as the medians for separating the classes.