STA 437/1005, 2009, Solution to Assignment #2, Dataset #1

STA 437/1005, 2009, Assignment #2, Dataset #1

Radford M. Neal, 2009.

Here is a ``model soluton'' for Assignment #2, Data set #1. This assignment involves many judgement calls about how best to look at the data, and what to conclude from what you see. The methods and conclusions here are not the only reasonable ones. So if your solution is not too close to mine, it is not necessarily wrong. However, there certainly are many wrong methods, and many wrong conclusions that one could come to - not everything is right!

The R commands used to analyse this dataset are here.

I began by looking at each variable separately, using histograms and normal QQ plots. Several suspicous observations were apparent:

Depending on what one is interested in, keeping observation 39 might be desirable (though it would be good to check that the measurements really are accurate). However, if we are not interested in such extreme individuals, we may wish to ignore this observation, which is what I will do here. Observations 31, 42, and 86 might be retained if the suspicious measurements of height and ankle are not crucial to what one is investigating. However, for simplicity, I will ignore these observations as well. I therefore created a new data frame without these four observations.

Ignoring these four observations, the histograms and QQ plots show that the variables have roughly normal distributions, except for age, which has a light left tail, and forearm, which perhaps has a heavy tail. Note that there is no reason to think that age would be normally distributed, and indeed it clearly won't have a normal distribution - in the population, age goes down to 0, but not negative values, and the minimum age in the data is 22, presumably because they didn't try to recruit younger men. I therefore did not look at transforming any variables to improve normality.

I next looked at pairwise scatterplots of the remaining observations.

In some of these pairwise plots, particularly those involving forearm, there are observations that appear to not follow the bivariate normal distribution that the rest of the observations seem to come from. One might regard these as outliers, and ignore these observations. However, from my own knowledge of human proportions, I am not confident that these measurements are errors, rather than just being the result of individuals with unusual characteristics.

The pairwise scatterplot of density versus pcfat also has outliers. A scatterplot of just these two variables is here. On this scatterplot, I have also plotted the line describing the documented relationship of pcfat to density, which is pcfat = 495/density - 450. Five observations clearly do not follow the relationship that the documentation claims was used to compute pcfat from density. It is not clear how this happened. Perhaps pcfat was computed by hand, sometimes incorrectly. Alternatively, perhaps pcfat was computed correctly, but both pcfat and density were then re-entered into the computer from a paper listing of the numbers, with some errors being made, in which case it could sometimes be the density number that is wrong, rather than the pcfat number. I decided to ignore these five observations, since the validity of the density and pcfat numbers cannot be established. I created a data frame that omits these five observations (as well as the four previously identified outliers), and used this data frame for the rest of the analysis.

As a final check on whether the data is multivariate normal, I looked at the squared statistical distances of data points from the sample mean. I omitted the density variable (since it has the same information as pcfat) and the age variable (since it is clearly not normally distributed) when computing these squared distances. If the remaining variables have a multivariate normal distribution, the squared statistical distances should have (approximately) the chi-squared(p) distribution, where p is the number of variables looked at. I check this by plotting the sorted values for the squared statistical distrance from the sample mean versus the corresponding quantiles of the chi-squared(p) distribution, which gave this plot.

One can see from this plot that most of the points follow the chi-squared(p) distibution, but about 10 of the largest statistical distances are larger than would be expected from a chi-squared(p) distribution. This is evidence that the distribution is not actually multivariate normal, but has heavier tails (in at least one direction) than a normal. Since there are a significant number of points in this tail, it seems unlikely that they are just erroneous measurements. Instead, they seem likely to be legitimate observations. As a further check on this, I looked at pairwise scatterplots with the observations with the 8 largest statistical distances from the mean plotted in red (here). These observations seem unexceptional in these scatterplots. I therefore did not delete these observations from the data set.

I then computed BMI, and eight variations on it (labelled A to H) of the form weight to the power p divided by height to the power q, with p = 0.8, 1.0, or 1.2 and q = 1.8, 2.0, or 2.2. For each version of BMI, I computed the correlation with pcfat. The results were as follows:

       bmi bmiA  bmiB  bmiC  bmiD bmiE  bmiF  bmiG  bmiH
     0.734 0.73 0.736 0.734 0.725 0.73 0.719 0.728 0.736
Only bmiB (p=1.0, q=2.2) and bmiH (p=0.8, q=1.8) seem (slightly) better than the standard version (p=1, q=2). I plotted pcfat versus bmi, bmiB, bmiC, and bmiH here, with the 8 observations with largest statistical distance from the sample mean in red. There seems to be little difference amongst these measures, with the standard BMI index seeming to be about as good as one can get (from just height and weight). There are two outlying points, with exeptionally large values for BMI, but not so exceptional (though large) values for pcfat. They do not depart markedly from a linear relationship of pcfat to BMI, but may still warrant further investigation to see whether BMI is a good measure of pcfat even for extreme points.