When running statistical models, like multiple linear regression or generalized linear models, it is typically not a good idea to use multiple predictor variables that are highly correlated with one another, as it may result in an unstable final model. This guideline also applies to many of various flavors of species distribution models (SDMs), which take in two or more (usually climatic) predictor variables to model a species response to environmental gradients. Given modeled climate output, the SDM models can be used to estimate the probability of a species occurring in a different time period, say at the last glacial maximum – 22,000 years ago–, or in 2100, once humans have contributed several degrees to the earth’s temperature. While these models differ in their statistical techniques, most behave a little like multiple regression, where a set of predict variables are combined in some parametric or non-parametric way to estimate the response as a function of these inputs. Thus, just like in a standard regression, using highly correlated climatic predictor variables can contribute to instability in the modeled response. This post describes the methods I chose to identify correlated variables and choose the ones I wanted to remain in my study.
Collinearity can increase estimates of parameter variance; yield models in which no variable is statistically significant even though R2y is large; produce parameter estimates of the “incorrect sign” and of implausible magnitude; create situations in which small changes in the data produce wide swings in parameter estimates; and, in truly extreme cases, prevent the numerical solution of a model (O’BRIEN, 2007)
The data I am using here is a set of 19 bioclimatic variables, commonly used in SDMs, that describe potentially meaningful ecological parameters, such as average precipitation in the warmest quarter of the year, and are derived from monthly climatic GCM output. My data is derived from a 0.5x0.5 degree CCSM3 climate model run, and is in raster grid format.
The first natural thing to do when looking at collinearity between variables would be to look at the pairwise correlation between them. This will show us the correlation between each raster, and every other raster in the set. We’re going to first do this by taking a large random sampling of spatial coordinates from the raster set. I have my predictor variables stored as a RasterStack, so I can use the sampleRandom tool in the raster package to generate n random points over my grid. I now how a dataframe that shows the climatic values at each of the random points. I can scatter them together, and interrogate each one for pearson’s r correlation.
The figure here shows an example of scattering biovariable 2 with biovariable 7, resulting in an R2 value of ~0.45. Now, I can do this procedure for all 19 of my predictor variables, but that would be kind of tedious. We would be left with 361 R2 values to compare, and 361 scatter plots to interrogate. Sounds annoying. We can simplify the visual comparison somewhat by plotting the correlation matrix. You can derived the correlation matrix for pearson’s r by using the layerStats function in the raster package. Using the corrplot library in R, we can format the correlation matrix so that we can easily visualization the magnitude and direction of the correlations. But, pretty much all we get out of this is that some variables are very well correlated and others are less correlated. Not easy to compare, still annoying.
Variance Inflation Factor
The variance inflation factor (VIF) was conceived for exactly this purpose – comparing collinearity between regression predictors. The VIF quantified the expected amount of variance in a regression coefficient that is due to collinearity in the predictors. The VIF is bounded on the bottom by 1, and has no upper bound. A VIF of 1 indicates that that no extra variance is caused by collinearity in the factors. A higher VIF indicates higher amounts of variances caused by collinearity, for example, a VIF of 5.1 suggests 510% extra variance in the regression coefficient. VIFs are comparable so we can directly compare and choose variables to eliminate based on this index.
Calculating the VIF:
Calculating the VIF index is pretty simple, actually, and can be done very quickly. To calculate the index for layer i, run a regression of Xi in terms of all of the other predictors that you wish to use. The form of the equation is:
Xi = αjXj + αkXk + … + αnXn + εi
Note that this is not a regression of the response variable for the model, instead, its a regression of the layer in terms of linear combinations of the other predictors.
Now, calculate the R2i coefficient of determination value for Xi.
Finally, calculate the VIF for predictor i by using VIF=1/(1-R2i).
If R2i is high for this layer, then the denominator of the VIF equation will be very small. Once evaluated, if R2i is close to 1 (perfectly correlated) then the VIF will go to infinity. On the other hand, if R2i is not correlated with any of the other predictors at all (R2i = 0), then the denominator of the VIF will be 1-0, and the VIF term will evaluate to just 1. We can do this for each one of our raster predictor layers, and then compare apples-to-apples how collinear they are.
I calculated the inflation factor using the R package usdm, which made the raster calculations quite convenient. Since R is open source, you can check out the source to see exactly what it’s doing, or you could probably roll your own VIF script pretty easily. In any case, the vifcor function in usdm was designed for finding collinear variables for species distribution models, returns both the VIF and the R2 value, was easy to use, and all-in-all, seemed like the right tool for the job.
Using the VIF
So, now we have the variance inflation factor for all of our layers we can figure out which ones are the most collinear and eliminate the ones might cause instability in our model. Plotting out the results of the VIF calculation, we can see what the correlation matrix plot suggested earlier. Some variables, like #11 and #1, are super, super correlated with the other variables in the set. But now we know exactly how much that would affect the predictive model.
Scholars debate how much inflation should be allowed in the model. Some say 10 (which appears to the the internet’s rule of thumb), some say five, some say 2.5. In any case, variable 11’s VF of ~7000 is probably above the acceptable threshold. For my purposes, all of the VIFs seemed quite high (none under 2.5). So, perhaps I should use different predictor variables. Since this set of biovariables is common standard for SDM modeling, I think its worth noting just how correlated they are with each other and the potential for instability that this causes. However, since they are such a standard, I decided to just take the seven least correlated predictors and use those as my final predictor set.
|2||36.405||Mean Diurnal Range|
|7||9.597||Temperature Annual Range|
|8||26.479||Mean Temperature of Wettest Quarter|
|18||17.879||Precipitation of Warmest Quarter|
|17||12.981||Precipitation of Driest Quarter|
When we reduce our predictor to use these variables and re-run the VIF calculation, our variance inflation is much lower (since there are less correlated things to be collinear with), and the analysis falls within the accepted standards for variable collinearity. The maximum correlation that remains is between #2 and #15, at only R=0.51, which seems reasonable.
In most cases, it is good to reduce multicollinearity between predictors. However, it’s also important to remember to keep ecologically meaningful predictors in your set so that you’re not just producing models of statistical artifacts.