Modeling species distributions using deep convolutional neural networks

I’ve recently been tooling around with deep learning frameworks TensorFlow and Caffe. I’ve worked some toy examples, and I have been thinking about ways to apply this to the problem domains I am most familiar with, in particular biodiversity modeling or, recently, infectious disease modeling. I think I have found an interesting and novel application of CNNs to the problem of modeling species distributions. This post is a quick attempt to outline these thoughts. If they still make sense in a couple days, I will attempt to put a prototype together.

The Problem: Make a prediction of species presence given climatic covariates. It’s been done many times in many ways, and as I note in my thesis, is extremely widely used in both management and the academy. However, there are known shortcomings of existing techniques. One of the drawbacks that comes to mind is that traditional methods for SDM (random forest, logistic regression, etc) take in a n x p design matrix of training data. In particular, these models accept n labeled examples of presence/absence and p climatic covariates found at that location. The model has no information about the climatic conditions in surrounding areas. Therefore, it may fail to capture information about climatic transitions e.g., the species has a strong affinity for being located on a hot->cold gradient. This is particularly important when fitting the models, as we often do, with outputs from a general circulation model, since the particular gridcell that the training example falls into may not be fully representative of the broader climatic patterns surrounding the site. Finally, it seems that this could be very valuable for predicting the distribution of animals and other mobile species (including pollen distributions), by characterizing spatial patterns in climatic covariates in the regions surrounding that in which it has been found.

The Idea: Turn the problem from site-by-site classification problem into a computer vision problem where each training example is an ‘image’ of the climatic space surrounding each observation with p channels. In this way, you’d be predicting whether or not the ‘image’ contained a species presence or not. The ‘image’ could be of configurable size, but I’m thinking it would be a 0.5 degree square box extending +/- 0.25 degrees from the location of the occurrence. If we’re working with a GCM output with 800 m^2 gridcells like the Lorenz et al, this box would contain several hundred gricells. If this box is our training data, we (1) get more information to feed to the classifier, and (2) get information about the configuration of how the climate space around a species presence (or absence) is arranged spatially. It’s a lot more data (in terms of training set bytes), and perhaps it would not be helpful (in terms of predictive accuracy), but it seems like it’d be a neat idea to try.

How would you do it?: I’ve got a rough plan for doing this. This is Neotoma-centric, but I think you could do the same thing with GBIF – it’d just be a lot more data. You could do it with any species, but here, I’m thinking we could try on Spruce (Picea) or Oak (Quercus) – they each have many thousands of occurrences and are well distributed, so they’d be good to start out on.

  1. Get the spatiotemporal location of all occurrences of terrestrial vascular species in Neotoma. Exclude aquatics, fungi, ostracodes, diatoms, and vertebrates. This would give us Latitude, Longitude, and Years BP for ~ 1 million observations. These are the negative training examples (P=0)

  2. Get the spatiotemporal location of all occurences of the species of interest in Neotoma (e.g., Quercus). You’d have to do a bit of work to make sure that the set from (1) didn’t overlap with this set, but that’s no problem. These are the positive training examples (P=1).

  3. For both positive and negative examples, download the climate space in a box extending 0.25 degrees N/S/E/W of the documented observation, for the given time period. For example, if this occurrence happened at 2000 years BP at latitude 0 longitude 0, download all 800 meter grid cells from the 2000 years BP climatic predictor layers extending [-0.25W, 0.25E] and [0.25N, -0.25S]. This is a perfect application of my Climate Data Service. It would need another endpoint, but if I recall correctly, I already wrote the logic for serving arbitrary bounding boxes when making custom png tiles. If this was incorporated in the service, all you’d have to do would be (1) calculate the BBox from spatial position and (2) hit the CDS (3) parse the results to a p-dimension numpy array. Some resizing or zero padding might need to be done to ensure that every example is exactly the same dimensions.

  4. Define and train a deep convolutional network. This would seem to be the hard part, but with frameworks like Caffe and TensorFlow, I believe this could be accomplished relatively easily. I believe we could have ~1-1.5 million training examples with which to build the network, a reasonable size to train a CNN from scratch. The output layer would have two classes (0: species not present, 1: species present).

  5. Of course, we’d have a validation set on which we could evaluate accuracy on different hyperparameter values, etc.

  6. Prediction: Prediction with this algorithm might get a little messy. Prediction for a single location would, of course, be very easy. It would require picking a lat/long, calculating the bounding box, getting the climate snapshot, and then applying the model. It would get trickier if you wanted to apply the model to get spatial predictions, e.g., a range map. I haven’t done much looking at things like Generative Adversarial Models, but perhaps they could be of use here, in generating range maps given input snapshots. A brute force attempt, however, would be possible. Define a two dimensional grid of lat/lngs, spaced at 800 meter intervals (or whatever resolution you used in training). For every grid cell, perform the steps above to get a prediction. That prediction would use data from the corresponding cells, but would only apply directly to point in the center of the bounding box. Iterate over every point in the grid. Then use an interpolation algorithm to smooth the point-estimates into a range map.

I think this could be pretty interesting. Of course, CNNs are notorious for being black boxes – you don’t know what they’re doing, which often doesn’t sit well with ecologists. And ecological data is messy. And, this might be fundamentally flawed – perhaps including the surrounding climate will lead to worse predictions! But if it did improve predictions, it would be a novel application of deep learning and cutting-edge computer vision techniques to the ecological domain.

Side Note: ForestNet: It takes a lot to train a CNN from scratch with random initialization. It seems as though using pre-initialized models often lead to better (and definitely faster) predictions, particularly if the new dataset is rather small. If the technique described above works, it seems possible to build a general ForestNet that could be used (1) as a land cover prediction and (2) be used to initialize weights for future models. Here’s what I’m thinking:

ForestNet would be trained using the process described above. However, instead of an individual species of interest (e.g., Oak) it would include an ecosystem of interest (in this case, Forest). The positive examples would be sites in which one or more (or n>q, where q is some threshold) forest-type species were present. The negative examples would be those sites where zero (or n<q, where q is some threshold) forst-types were found. The training would proceed as above. The resulting classifier would be general, and could be used to predict the presence or absence of a forest (or other ecosystem type). Once a trained network was available, future species-specific networks could be made by initializing a network with the ForestNet weights, and then fine-tuning the network to predict the specific species.