STA 437 Assignment 1, Fall 2010, Solution to Part I

See the assignment handout here. The R commands used are here (Unix/Linux/Mac) or here (MS Windows).

Transformations and seasonal effect

Our first task is to model the seasonal effect on average wind speed on Sundays at the 12 locations, and to see which transformation (none, log, sqrt) of the average wind speed produces the best (normal, equal variance) distribution for the residuals after seasonal effect is removed.

I did a linear fit of the average wind speed on Sundays on the sin and cos of year angle, and the same for log of the average and square root of the average. The results are shown in this plot. The top graphs show the data and the predictions from the linear fit on sin and cos of year angle. The bottom graphs show the residuals.

With no transformation, the variance of the residuals seems visually to be larger in winter than in summer (and this is confirmed by R computations). With no transformation, the residuals also show some right (positive) skew. With a log transformation, the residuals seem to have nearly the same distribution in all seasons, but this distribution shows left (negative) skew. With a square root transformation, the residuals seem closer to having a normal distribution, but there still seems to be higher variance in winter than summer (though less so than with no transformation). The above conclusions regarding skewness or normality of the distributions are confirmed by the histograms and normal quantile-quantile plots here.

These results indicate that using the averages without transformation is not very good. One could argue for either a log transformation, to produce more equal variances in summer and winter, or for the square root transformation, to make the residuals have closer to a normal distribution. Since the sample size is quite large, exact normality may not be crucial (the Central Limit Theorem will probably correct any problems when looking at p-values based on normality), so perhaps the log transformation is most appropriate. However, as instructed in the assignment handout, I use the square root transformation below, as Haslett and Raftery did.

Correlations and scatterplots

The sample correlation matrix for square roots of wind speeds at the 12 locations minus fitted seasonal effect (based on square root of average wind speed) shows high correlations among all the locations:
     RPT  VAL  ROS  KIL  SHA  BIR  DUB  CLA  MUL  CLO  BEL  MAL
RPT 1.00 0.80 0.73 0.84 0.82 0.78 0.71 0.74 0.77 0.72 0.63 0.60
VAL 0.80 1.00 0.58 0.75 0.84 0.80 0.64 0.80 0.72 0.72 0.75 0.58
ROS 0.73 0.58 1.00 0.77 0.60 0.66 0.67 0.62 0.65 0.61 0.49 0.47
KIL 0.84 0.75 0.77 1.00 0.84 0.85 0.79 0.81 0.85 0.82 0.69 0.62
SHA 0.82 0.84 0.60 0.84 1.00 0.89 0.76 0.87 0.83 0.80 0.77 0.64
BIR 0.78 0.80 0.66 0.85 0.89 1.00 0.80 0.88 0.89 0.86 0.79 0.67
DUB 0.71 0.64 0.67 0.79 0.76 0.80 1.00 0.76 0.86 0.83 0.68 0.75
CLA 0.74 0.80 0.62 0.81 0.87 0.88 0.76 1.00 0.85 0.86 0.85 0.69
MUL 0.77 0.72 0.65 0.85 0.83 0.89 0.86 0.85 1.00 0.87 0.75 0.75
CLO 0.72 0.72 0.61 0.82 0.80 0.86 0.83 0.86 0.87 1.00 0.80 0.78
BEL 0.63 0.75 0.49 0.69 0.77 0.79 0.68 0.85 0.75 0.80 1.00 0.72
MAL 0.60 0.58 0.47 0.62 0.64 0.67 0.75 0.69 0.75 0.78 0.72 1.00
These positive correlations are also seen in the pairwise scatterplots here.

The pairwise scatterplots also show that the bivariate distributions are all approximately bivariate normal, though there are some departures from elliptical contours, such as with RPT and ROS. No outliers are apparent in these plots.

Principal components

I found the principal components from the covariance matrix of the square roots of Sunday wind speeds at the 12 locations, minus the fitted seasonal effect.

Pairwise scatterplots of the projections on the first four principal components are here. One observation with a high value on the fourth principal component might possibly be regarded as an outlier, but I decided that it was not extreme enough to warrant this.

The first six principal components have the following proportions of the variance:

     PC1    PC2   PC3    PC4    PC5    PC6 
   0.774 0.0601 0.048 0.0268 0.0235 0.0135 
The large proportion for the first principal component is expected, due to the strong correlation among all 12 locations. The variances for later components drop less dramatically. Three components seems a reasonable number to look at, though later ones might perhaps contain useful information as well.

The eigenvectors for the first three principal components are as follows:

      PC1   PC2   PC3
RPT -0.29 -0.41  0.01
VAL -0.29 -0.20 -0.50
ROS -0.22 -0.46  0.40
KIL -0.28 -0.26  0.13
SHA -0.30 -0.10 -0.25
BIR -0.32 -0.04 -0.08
DUB -0.29  0.10  0.45
CLA -0.31  0.09 -0.22
MUL -0.29  0.08  0.17
CLO -0.30  0.20  0.11
BEL -0.30  0.37 -0.36
MAL -0.27  0.55  0.30
As expected, the first principal component is more-or-less an average of the wind speed at all 12 locations (arbitrary given a negative sign), as expected given that they all have positive correlations. The second principal component weights the first six locations negatively and the second six positively, and in fact the coefficients increase from negative to positive values almost monotonically. Since the order of locations is roughly from South to North, this component seems to be picking up the difference between wind speed in the South of Ireland versus the North of Ireland. The third principal component is not readily interpetable, though perhaps it would be if one looked at more detailed information about the geography of Ireland.

Predicting average wind speed on Monday

The simplest model for predicting average wind speed on Monday (with square root transformation, minus seasonal effect fit for the preceding Sunday) is to use only the square root of the average wind speed on the preceding Sunday (minus seasonal effect). Here is part of the output from fitting such a model:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.00937    0.01808  -0.518    0.604
sws.ave      0.55311    0.02715  20.372   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5541 on 937 degrees of freedom
Multiple R-squared: 0.307,      Adjusted R-squared: 0.3062
The dependence on the single covariate is highly significant, so the adjusted R-squared of 0.3062 is certainly not due to chance. We can see how much better we can do with other models, however.

Using wind speeds at all 12 locations on the preceding Sunday (with square root transformation, minus fitted seasonal effect), we get the following model:

                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       -0.148100   0.069165  -2.141 0.032514 *
as.matrix(sws)RPT  0.065085   0.050533   1.288 0.198071
as.matrix(sws)VAL  0.177508   0.046869   3.787 0.000162 ***
as.matrix(sws)ROS -0.098517   0.042419  -2.322 0.020424 *
as.matrix(sws)KIL -0.087561   0.065004  -1.347 0.178310
as.matrix(sws)SHA  0.056055   0.063164   0.887 0.375067
as.matrix(sws)BIR  0.014955   0.064765   0.231 0.817437
as.matrix(sws)DUB  0.201114   0.049145   4.092 4.64e-05 ***
as.matrix(sws)CLA  0.060838   0.061526   0.989 0.323011
as.matrix(sws)MUL -0.032385   0.066317  -0.488 0.625433
as.matrix(sws)CLO -0.079055   0.061314  -1.289 0.197600
as.matrix(sws)BEL  0.221063   0.045985   4.807 1.78e-06 ***
as.matrix(sws)MAL -0.005927   0.039960  -0.148 0.882112
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5345 on 926 degrees of freedom
Multiple R-squared: 0.3626,     Adjusted R-squared: 0.3544
The adjusted R-squared is higher than when using just the average wind speed on Sunday. Without a formal test, it's not obvious whether this increase is statistically significant, but it's large enough that it seems intuitively that it probably is.

Rather than use wind speeds at all 12 locations, we can try reducing the data to projections on 1, 2, or 3 principal components, which gives the following three models:

             Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.009370   0.018035   -0.52    0.604
pc$x[, 1]   -0.154812   0.007534  -20.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5527 on 937 degrees of freedom
Multiple R-squared: 0.3106,     Adjusted R-squared: 0.3099

                Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.009370   0.017951  -0.522  0.60183
pc$x[, 1:2]PC1 -0.154812   0.007499 -20.644  < 2e-16 ***
pc$x[, 1:2]PC2  0.084267   0.026923   3.130  0.00180 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5501 on 936 degrees of freedom
Multiple R-squared: 0.3178,     Adjusted R-squared: 0.3163

                Estimate Std. Error t value Pr(>|t|)
(Intercept)    -0.009370   0.017646  -0.531   0.5956
pc$x[, 1:3]PC1 -0.154812   0.007372 -21.001  < 2e-16 ***
pc$x[, 1:3]PC2  0.084267   0.026464   3.184   0.0015 **
pc$x[, 1:3]PC3 -0.171788   0.029596  -5.805 8.84e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5407 on 935 degrees of freedom
Multiple R-squared: 0.3415,     Adjusted R-squared: 0.3394
Using just one principal component gives an adjusted R-squared very close to using the average, which is not surprising since we saw above that the first principal component is very similar to the average. Adding the second and the third principal component improves adjusted R-squared, though it does not quite reach the value obtained when all 12 wind speeds are used. The p-values for the second and third PC indicate that they do actually improve the predictions.

The conclusion from this is that the data from the 12 locations can be reduced to just 3 principal components with only a small loss in predictive ability.