I did a linear fit of the average wind speed on Sundays on the sin and cos of year angle, and the same for log of the average and square root of the average. The results are shown in this plot. The top graphs show the data and the predictions from the linear fit on sin and cos of year angle. The bottom graphs show the residuals.

With no transformation, the variance of the residuals seems visually to be larger in winter than in summer (and this is confirmed by R computations). With no transformation, the residuals also show some right (positive) skew. With a log transformation, the residuals seem to have nearly the same distribution in all seasons, but this distribution shows left (negative) skew. With a square root transformation, the residuals seem closer to having a normal distribution, but there still seems to be higher variance in winter than summer (though less so than with no transformation). The above conclusions regarding skewness or normality of the distributions are confirmed by the histograms and normal quantile-quantile plots here.

These results indicate that using the averages without transformation is not very good. One could argue for either a log transformation, to produce more equal variances in summer and winter, or for the square root transformation, to make the residuals have closer to a normal distribution. Since the sample size is quite large, exact normality may not be crucial (the Central Limit Theorem will probably correct any problems when looking at p-values based on normality), so perhaps the log transformation is most appropriate. However, as instructed in the assignment handout, I use the square root transformation below, as Haslett and Raftery did.

RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL RPT 1.00 0.80 0.73 0.84 0.82 0.78 0.71 0.74 0.77 0.72 0.63 0.60 VAL 0.80 1.00 0.58 0.75 0.84 0.80 0.64 0.80 0.72 0.72 0.75 0.58 ROS 0.73 0.58 1.00 0.77 0.60 0.66 0.67 0.62 0.65 0.61 0.49 0.47 KIL 0.84 0.75 0.77 1.00 0.84 0.85 0.79 0.81 0.85 0.82 0.69 0.62 SHA 0.82 0.84 0.60 0.84 1.00 0.89 0.76 0.87 0.83 0.80 0.77 0.64 BIR 0.78 0.80 0.66 0.85 0.89 1.00 0.80 0.88 0.89 0.86 0.79 0.67 DUB 0.71 0.64 0.67 0.79 0.76 0.80 1.00 0.76 0.86 0.83 0.68 0.75 CLA 0.74 0.80 0.62 0.81 0.87 0.88 0.76 1.00 0.85 0.86 0.85 0.69 MUL 0.77 0.72 0.65 0.85 0.83 0.89 0.86 0.85 1.00 0.87 0.75 0.75 CLO 0.72 0.72 0.61 0.82 0.80 0.86 0.83 0.86 0.87 1.00 0.80 0.78 BEL 0.63 0.75 0.49 0.69 0.77 0.79 0.68 0.85 0.75 0.80 1.00 0.72 MAL 0.60 0.58 0.47 0.62 0.64 0.67 0.75 0.69 0.75 0.78 0.72 1.00These positive correlations are also seen in the pairwise scatterplots here.

The pairwise scatterplots also show that the bivariate distributions are all approximately bivariate normal, though there are some departures from elliptical contours, such as with RPT and ROS. No outliers are apparent in these plots.

Pairwise scatterplots of the projections on the first four principal components are here. One observation with a high value on the fourth principal component might possibly be regarded as an outlier, but I decided that it was not extreme enough to warrant this.

The first six principal components have the following proportions of the variance:

PC1 PC2 PC3 PC4 PC5 PC6 0.774 0.0601 0.048 0.0268 0.0235 0.0135The large proportion for the first principal component is expected, due to the strong correlation among all 12 locations. The variances for later components drop less dramatically. Three components seems a reasonable number to look at, though later ones might perhaps contain useful information as well.

The eigenvectors for the first three principal components are as follows:

PC1 PC2 PC3 RPT -0.29 -0.41 0.01 VAL -0.29 -0.20 -0.50 ROS -0.22 -0.46 0.40 KIL -0.28 -0.26 0.13 SHA -0.30 -0.10 -0.25 BIR -0.32 -0.04 -0.08 DUB -0.29 0.10 0.45 CLA -0.31 0.09 -0.22 MUL -0.29 0.08 0.17 CLO -0.30 0.20 0.11 BEL -0.30 0.37 -0.36 MAL -0.27 0.55 0.30As expected, the first principal component is more-or-less an average of the wind speed at all 12 locations (arbitrary given a negative sign), as expected given that they all have positive correlations. The second principal component weights the first six locations negatively and the second six positively, and in fact the coefficients increase from negative to positive values almost monotonically. Since the order of locations is roughly from South to North, this component seems to be picking up the difference between wind speed in the South of Ireland versus the North of Ireland. The third principal component is not readily interpetable, though perhaps it would be if one looked at more detailed information about the geography of Ireland.

Estimate Std. Error t value Pr(>|t|) (Intercept) -0.00937 0.01808 -0.518 0.604 sws.ave 0.55311 0.02715 20.372 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5541 on 937 degrees of freedom Multiple R-squared: 0.307, Adjusted R-squared: 0.3062The dependence on the single covariate is highly significant, so the adjusted R-squared of 0.3062 is certainly not due to chance. We can see how much better we can do with other models, however.

Using wind speeds at all 12 locations on the preceding Sunday (with square root transformation, minus fitted seasonal effect), we get the following model:

Estimate Std. Error t value Pr(>|t|) (Intercept) -0.148100 0.069165 -2.141 0.032514 * as.matrix(sws)RPT 0.065085 0.050533 1.288 0.198071 as.matrix(sws)VAL 0.177508 0.046869 3.787 0.000162 *** as.matrix(sws)ROS -0.098517 0.042419 -2.322 0.020424 * as.matrix(sws)KIL -0.087561 0.065004 -1.347 0.178310 as.matrix(sws)SHA 0.056055 0.063164 0.887 0.375067 as.matrix(sws)BIR 0.014955 0.064765 0.231 0.817437 as.matrix(sws)DUB 0.201114 0.049145 4.092 4.64e-05 *** as.matrix(sws)CLA 0.060838 0.061526 0.989 0.323011 as.matrix(sws)MUL -0.032385 0.066317 -0.488 0.625433 as.matrix(sws)CLO -0.079055 0.061314 -1.289 0.197600 as.matrix(sws)BEL 0.221063 0.045985 4.807 1.78e-06 *** as.matrix(sws)MAL -0.005927 0.039960 -0.148 0.882112 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5345 on 926 degrees of freedom Multiple R-squared: 0.3626, Adjusted R-squared: 0.3544The adjusted R-squared is higher than when using just the average wind speed on Sunday. Without a formal test, it's not obvious whether this increase is statistically significant, but it's large enough that it seems intuitively that it probably is.

Rather than use wind speeds at all 12 locations, we can try reducing the data to projections on 1, 2, or 3 principal components, which gives the following three models:

Estimate Std. Error t value Pr(>|t|) (Intercept) -0.009370 0.018035 -0.52 0.604 pc$x[, 1] -0.154812 0.007534 -20.55 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5527 on 937 degrees of freedom Multiple R-squared: 0.3106, Adjusted R-squared: 0.3099 Estimate Std. Error t value Pr(>|t|) (Intercept) -0.009370 0.017951 -0.522 0.60183 pc$x[, 1:2]PC1 -0.154812 0.007499 -20.644 < 2e-16 *** pc$x[, 1:2]PC2 0.084267 0.026923 3.130 0.00180 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5501 on 936 degrees of freedom Multiple R-squared: 0.3178, Adjusted R-squared: 0.3163 Estimate Std. Error t value Pr(>|t|) (Intercept) -0.009370 0.017646 -0.531 0.5956 pc$x[, 1:3]PC1 -0.154812 0.007372 -21.001 < 2e-16 *** pc$x[, 1:3]PC2 0.084267 0.026464 3.184 0.0015 ** pc$x[, 1:3]PC3 -0.171788 0.029596 -5.805 8.84e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.5407 on 935 degrees of freedom Multiple R-squared: 0.3415, Adjusted R-squared: 0.3394Using just one principal component gives an adjusted R-squared very close to using the average, which is not surprising since we saw above that the first principal component is very similar to the average. Adding the second and the third principal component improves adjusted R-squared, though it does not quite reach the value obtained when all 12 wind speeds are used. The p-values for the second and third PC indicate that they do actually improve the predictions.

The conclusion from this is that the data from the 12 locations can be reduced to just 3 principal components with only a small loss in predictive ability.