Selecting Variables for Species Distribution Models

When running statistical models, like multiple linear regression or generalized linear models, it is typically not a good idea to use multiple predictor variables that are highly correlated with one another, as it may result in an unstable final model. This guideline also applies to many of various flavors of species distribution models (SDMs), which take in two or more (usually climatic) predictor variables to model a species response to environmental gradients.  Given modeled climate output, the SDM models can be used to estimate the probability of a species occurring in a different time period, say at the last glacial maximum – 22,000 years ago–, or in 2100, once humans have contributed several degrees to the earth’s temperature.  While these models differ in their statistical techniques, most behave a little like multiple regression, where a set of predict variables are combined in some parametric or non-parametric way to estimate the response as a function of these inputs.  Thus, just like in a standard regression, using highly correlated climatic predictor variables can  contribute to instability in the modeled response.  This post describes the methods I chose to identify correlated variables and choose the ones I wanted to remain in my study.

Collinearity can increase estimates of parameter variance; yield models in which no variable is statistically significant even though R2y is large; produce parameter estimates of the “incorrect sign” and of implausible magnitude; create situations in which small changes in the data produce wide swings in parameter estimates; and, in truly extreme cases, prevent the numerical solution of a model (O’BRIEN, 2007)

The data

The data I am using here is a set of 19 bioclimatic variables, commonly used in SDMs, that describe potentially meaningful ecological parameters, such as average precipitation in the warmest quarter of the year, and are derived from monthly climatic GCM output.  My data is derived from a 0.5x0.5 degree CCSM3 climate model run, and is in raster grid format.


The first natural thing to do when looking at collinearity between variables would be to look at the pairwise correlation between them.  This will show us the correlation between each raster, and every other raster in the set.  We’re going to first do this by taking a large random sampling of spatial coordinates from the raster  set.  I have my predictor variables stored as a RasterStack, so I can use the sampleRandom tool in the raster package to generate n random points over my grid.  I now how a dataframe that shows the climatic values at each of the random points. I can scatter them together, and interrogate each one for pearson’s r correlation.  bio2_7

The figure here shows an example of scattering biovariable 2 with biovariable 7, resulting in an R2 value of ~0.45. Now, I can do this procedure for all 19 of my predictor variables, but that would be kind of tedious.  We would be left with 361 R2 values to compare, and 361 scatter plots to file-page1interrogate.  Sounds annoying.  We can simplify the visual comparison somewhat by plotting the correlation matrix. You can derived the correlation matrix for pearson’s r by using the layerStats function in the raster package.  Using the corrplot library in R, we can format the correlation matrix so that we can easily visualization the magnitude and direction of the correlations.  But, pretty much all we get out of this is that some variables are very well correlated and others are less correlated.  Not easy to compare, still annoying.

Variance Inflation Factor

The variance inflation factor (VIF) was conceived for exactly this purpose – comparing collinearity between regression predictors. The VIF quantified the expected amount of variance in a regression coefficient that is due to collinearity in the predictors. The VIF is bounded on the bottom by 1, and has no upper bound. A VIF of 1 indicates that that no extra variance is caused by collinearity in the factors.  A higher VIF indicates higher amounts of variances caused by collinearity, for example, a VIF of 5.1 suggests 510% extra variance in the regression coefficient. VIFs are comparable so we can directly compare and choose variables to eliminate based on this index.

Calculating the VIF:

Calculating the VIF index is pretty simple, actually, and can be done very quickly.  To calculate the index for layer i, run a regression of X in terms of all of the other predictors that you wish to use.  The form of the equation is:

Xi = αjXj + αkXk + … + αnXn + εi

Note that this is not a regression of the response variable for the model, instead, its a regression of the layer in terms of linear combinations of the other predictors.

Now, calculate the R2i coefficient of determination value for Xi.

Finally, calculate the VIF for predictor i by using VIF=1/(1-R2i).

If R2i is high for this layer, then the denominator of the VIF equation will be very small.  Once evaluated, if R2i is close to 1 (perfectly correlated) then the VIF will go to infinity. On the other hand, if R2i is not correlated with any of the other predictors at all (R2i = 0), then the denominator of the VIF will be 1-0, and the VIF term will evaluate to just 1.  We can do this for each one of our raster predictor layers, and then compare apples-to-apples how collinear they are.

I calculated the inflation factor using the R package usdm, which made the raster calculations quite convenient.  Since R is open source, you can check out the source to see exactly what it’s doing, or you could probably roll your own VIF script pretty easily. In any case, the vifcor function in usdm was designed for finding collinear variables for species distribution models, returns both the VIF and the R value, was easy to use, and all-in-all, seemed like the right tool for the job.

Using the VIF

So, now we have the variance inflation factor for all of our layers we can figure out which ones are the most collinear and eliminate the ones might cause instability in our model. Plotting out the results of the VIF calculation, we can see what the correlation matrix plot suggested earlier. Some variables, like #11 and #1, are super, super correlated with the other variables in the set.  But now we know exactly how much that would affect the predictive model. vif_plot-1

Scholars debate how much inflation should be allowed in the model.  Some say 10 (which appears to the the internet’s rule of thumb), some say five, some say 2.5.  In any case, variable 11’s VF of ~7000 is probably above the acceptable threshold. For my purposes, all of the VIFs seemed quite high (none under 2.5).  So, perhaps I should use different predictor variables.  Since this set of biovariables is common standard for SDM modeling, I think its worth noting just how correlated they are with each other and the potential for instability that this causes.  However, since they are such a standard, I decided to just take the seven least correlated predictors and use those as my final predictor set.

Biovariable VIF Interpretation
2 36.405 Mean Diurnal Range
7 9.597 Temperature Annual Range
8 26.479 Mean Temperature of Wettest Quarter
15 94.520456 Precipitation Seasonality
18 17.879 Precipitation of Warmest Quarter
17 12.981 Precipitation of Driest Quarter

When we reduce our predictor to use these variables and re-run the VIF calculation, our variance inflation is much lower (since there are less correlated things to be collinear with), and the analysis falls within the accepted standards for variable collinearity.  The maximum correlation that remains is between #2 and #15, at only R=0.51, which seems reasonable.

Variable VIF
bio2 2.349056
bio7 1.554106
bio8 1.660181
bio15 1.721449
bio18 1.434528
bio19 1.362137

In most cases, it is good to reduce multicollinearity between predictors.  However, it’s also important to remember to keep ecologically meaningful predictors in your set so that you’re not just producing models of statistical artifacts.