STA 437/1005, 2009, Solution to Assignment #3, Dataset #1

STA 437/1005, 2009, Assignment #3, Dataset #1

Here is a ``model soluton'' for Assignment #3, Data set #1. Note that the methods and conclusions here are not always the only reasonable ones. So if your solution is not the same as mine, it is not necessarily wrong. However, there certainly are many wrong methods, and many wrong conclusions that one could come to - not everything is right!

The R commands used to analyse this dataset are here. The textual output from these commands is here.

I performed PCA on three versions of the 12 variables other than pcfat - first, on the original variables, second on the variables scaled by dividing by their sample standard deviations, and third on the variables scaled by dividing by their sample mean. The first two are standard methods (suggested in the handout), which can be thought of as looking at the sample covariance matrix (unscaled) or the sample correlation matrix (scaled). The third method of dividing by the sample means makes at least some sense because the data is all positive. Like dividing by the sample standard deviation, it eliminates dependence on the units of measurement, and will emphasize variables that have a large standard deviation compared to their means.

The scree plots for all three versions of PCA show that the first PC captures a large amount of the variance, the second PC a much smaller amount, and later PCs successively smaller amounts, with no sharp drop except after the first PC. Keeping two PCs (as the assignment handout says to do later) is reasonable, but one might certainly decide to keep more (or perhaps only one).

Looking at the principal component directions (eigenvectors), one can see that when PCA is done without scaling, the first principal component has a large coefficient for weight, since it has a much larger sample standard deviation than the other variables. Weight is the only variable measured in pounds, so its relative standard deviation is affected by this arbitrary choice of unit (it would be much smaller if weight were in stones, and even bigger if weight were in grams). Height (in inches) is also arbitrarily different from the other variables (in cm). When the variables are scaled by either the sample standard deviation or the sample mean, the coefficient on weight has a magnitude closer to the coefficients of other variables.

With all versions of PCA, the first PC has the same sign of coefficient for all variables (the signs happen to all be positive, but all negative would mean the same thing). This may reflect the variation among men in overall size. Generally larger men will tend to be larger in all the measured dimensions. The second PC seems to represent variation in body proportions. Without scaling, the second PC has negative coefficients on chest and abdom, and positive on weight, height, thigh, and most other variables. This may be because some men have a fat middle, but others not. A similar pattern is seen when the variables are scaled by the sample mean. When the variables are scaled by the sample standard deviation, a somewhat different pattern is seen, in which height, knee, ankle, and wrist have positive coefficients, perhaps representing measurements that aren't much affected by obesity.

Factor analysis with one factor produces a model in which all the loadings have the same sign, which again is probably due to the variation in overall size of the men. With two hidden factors, the interpretation must take acount of the arbitrary rotation of the factors, as discussed next.

To compare these PCA and factor analysis results, we can look at the linear combinations of unscaled (but centred) variables that would produce the projections on the two PCs, or the two factor scores. I plotted the coefficients of these two linear combinations as points in a plane, with each coefficient identified by a single letter abbreviation for its variable (W=weight, H=height, n=neck, c=chest, d=abdom, h=hip, t=thigh, k=knee, a=ankle, b=biceps, f=forearm, w=wrist). This results in these plots. Alternatively, we could find the linear combinations of the variables scaled by their standard deviations, which produces these plots.

Looking at the linear combinations that give the factor analysis scores, we can see that all the points except for weight (W) are almost on a line passing through the origin. A rotation therefore exists that would make one factor score depend almost only on weight. After this rotation, the other factor score would be strongly influenced by height (H), chest (c), and abdom (d), with height having opposite sign from chest and abdom. This second factor seems to be representing body proportions.

In these plots, the results of PCA without scaling seem somewhat similar to the results of FA, with weight far from the other variables, and height, chest, and abdom in similar positions. The results of PCA with the two forms of scaling seem rather different.

Linear regression for predicting pcfat from all 12 other variables gives an adjusted R-squared of 0.7283. Including BMI has little effect. When using two PCs, the adjusted R-squared values were 0.6758, 0.5700, and 0.6294 when the variables are unscale, scaled by the sample standard deviations, and scaled by the sample means. So predictive performance is not as good as with all variables, but is only modestly affected when the unscaled variable are used. When factor scores from the two-factor model were used, the adjusted R-squared was 0.6447.

It is perhaps a bit surprising that PCA on the unscaled variables does best here, since it is affected by the arbitrary choice of units, which make weight have a large effect. But this is certainly not impossible - weight may just be an important variable. We see that fairly good regression results can be obtained after reducing the data to two variables from 12. In this particular problem, however, it turns out that just selecting the two most significant variables in the regression on all 12 variables (abdom and wrist) gives an even better two-variable model, with adjusted R-squared of 0.7128.

Adding BMI to the regression turns out to be unhelpful in all these situations. The particular non-linear combination of height and weight that is BMI seems to not be useful when these other variables are available. (It helps slightly when only height and weight are available.)