May 25, 2017
I’m writing a review paper that synthesizes some of the recent literature in Big Data and ecoinformatics and situates ecoinformatics as a Big Data science. I’m drawing on some work I did for my Masters thesis that examines the growth curves of the Neotoma Database and GBIF. Jack and I have wanted to see how the Paleobiology Database lines up with these other allied resources, but their API is not as user friendly for getting stats as the other two. GBIF wins in the stats category, with a whole endpoint and
r function designed for getting custom stats reports.
Apr 18, 2017
Yesterday I successfully defended my masters thesis project in both a public talk in front of an audience of about 20 of my friends (and mother!) and in a closed-door session with my committee.
Mar 29, 2017
Last week I have the opportunity to present a short talk at the Digital Data in Paleontological Research, sponsored by iDigBio in Berkeley, CA. It was a great opportunity to get to know a community of people working with the same types of data as I am, but tackling totally different problems. The goals of my talk were (1) to cast Neotoma and PBDB as excellent additions to modeling pipelines and (2) to introduce my PhD research on reconstructing past land cover. Most of the folks were verebrate paleontologists, and many use image, CT, or 3d databases, rather than the paleobiodiversity databases I am familiar with, so I thought it was important to highlight the ability of these databases to serve a key role on modeling pipelines. Here are the slides from my talk.
Feb 21, 2017
I recently spent several days trying to solve a (seemingly) simple problem: vectorizing some rasters. I’m making an interactive of ice sheet volume during deglaciation, and I needed the data in geojson format to put into my mapboxgl map. After finding the data there were two main problems I encountered:
Feb 5, 2017
Sometimes, especially when working with climate data, it is necessary to work with NetCDFs (Network Common Data Format – standard data packaging for scientific data). While NetCDF is good for data distribution, storage, and provenance, it is less good for working into standard raster data processing workflows. If you’re used to working with Tiffs or JPEGs, the following command might be helpful, assuming you have gdal installed on your machine.
Jan 31, 2017
Dec 16, 2016
Several months back, I wrote about the niche API that I was putting together. I put that on hold for a hot minute while I was working on other projects, most notably, my thesis. Since the Flyover Country team came to Madison in November, I’ve been pretty psyched up about it again, and have been working on it consistently for several weeks. I’m pretty confident in it’s current version, and I hope you head over to the live demo page or the github to check it out.
Nov 3, 2016
I’m psyched to put out that a map that I worked on last spring won the North American Cartographic Society Best Dynamic Map award at their annual conference a couple weeks back. It was a blast working with Starr and Meghan on this project and well worth the late nights. Looking forward to more interactive mapping coming up. If you’ve got something you want to map – let me know!
Oct 22, 2016
A few weeks ago, I posted about Big Data in the field of ecology, specifically biodiversity informatics, where I looked at the holdings of the Neotoma Paleoecological Database and the Global Biodiversity Information Facility (GBIF). I made some comparisons between the two databases, though the units of scale were different. In Neotoma, I used the number of datasets that had been submitted, while for GBIF, I commented on the number of occurrences. These are two fundamentally different units, so I set out to resolve the issue and find out how many occurrences there are in Neotoma.
Sep 19, 2016
I’ve been thinking a lot about my PhD research, even as I’ve been working 12 hours a day to finish my master’s thesis. If you were recently thinking “I wonder what Scott will be doing in Wisconsin for the next 3-6 years”, wonder no more. This is the Cliff Notes version, expect a 150-200 page version in early 2021.
Aug 31, 2016
We’ve all heard the term ‘Big Data’, though it’s often thrown around as a techy buzzword, along with others, like ‘The Cloud’, without a clear meaning. In the Williams Lab, we’re working with datasets that are sometimes called ‘Big Data’ in talks by @iceageecolgist and others, housed in databases like Neotoma, the Global Biodiversity Information Facility, and the Paleobiology Database. Today, I ask, what characteristics of our data make it ‘Big Data’?
Aug 21, 2016
One of the features Rob suggested I add to Ice Age Mapper during our last meeting was a dynamic url that would record the current state of the application, and could thus be shared between users. I took a stab at that last week, and got it working pretty well. I thought it would be a lot of re-coding from the ground up, but it turns out that most of what I had written previously could be easily converted to load a URL string. My application only generates a shareable URL when the user clicks the ‘Share’ button, but in theory, the app could easily be modified to generate a new URL each time an action was taken. I think this would actually Not be a good idea, because it would mean there would be an entry in the user’s web history for each action they took inside of the application, meaning they would have to click the back button like a million times if they messed up. Good to know support exists for that though.
Jul 19, 2016
apt-get package manager doesn’t contain the latest version of
R automatically, I’m not sure. I recently realized I have been downloading a 2+ year old distribution for all of my SDM timing runs by running the standard
sudo apt-get install r-base command at the shell. For several weeks, this was fine, but today the package
Rcpp, which wraps compiled C++ code in the R environment failed to compile. I spent most of the afternoon trying to figure out what was going on. I didn’t even occur to me that the
r-base package I was using was the root cause.
Jun 26, 2016
I’ve made it through 4,830 of the experiments I want to run for my thesis, so I’m taking this opportunity to reflect on the preliminary results that I have so far, visually check what I have so far, and make any necessary changes before doing the more expensive portion of the experiments. So far, the results look okay, but definitely not what I expected. The effect of the computing configuration on computing time seems to be minimal. On the other hand, the effect of different experimental parameters is pretty significant.
Jun 26, 2016
I’ve spent some time over the past couple of weeks building out the Niche API, a set of web services that enable you to get global climate model (GCM) simulated climate data for specific points in space and time. For this project I’ve been mixing database design, backend web programming, and a bit of cloud computing. It’s been a fun process, and is turning into what I think will be a very useful tool. In this post, I put down a few thoughts about the decisions I made, the techniques I used, and the problems I faced.
Jun 16, 2016
I’ve made significant progress in getting a couple of Google’s computers to do my bidding (aka my thesis), in an automated way, so I thought I would share my experience setting up my cluster, and, specifically, the configuration of computing nodes and database/control nodes. My setup draws on a bit on the design of larger systems like Hadoop, which create frameworks for massively large and distributed fault-tolerant systems. In short, I have one Master Node that hosts a database and a control script, and a pool of compute nodes that are fault-tolerant and designed only for computing. The compute nodes don’t have to know anything about the progress of the entire project and can handle being shut down mid-run, and the control node doesn’t have to know anything about the simulations being computed.
Jun 12, 2016
In my work, I have several times encountered the need to run a script for an extended period of time, or as a daemon (always running as a service). Whether you’re on your own personal computer or SSHed into a virtual machine in the cloud, managing processes that take a long time can be annoying. If you finish your work day and close your laptop, you’re going to stop your script. In the cloud (or I guess on a desktop/personal server too) you can take a couple steps to run scripts as services that will not stop when you end your work day. There are a couple of ways of doing it that I’ve found. Here are two that matched my needs.
Jun 5, 2016
In continuing my meditations on beginning to use the Google Cloud Computing platform, this post will describe the use of startup and shutdown scripts. If you want to start multiple instances that are all the same in terms of programs, data, etc (but perhaps of different size), you have two options. First, you could save your fully configured machine as an image, or more likely, as a snapshot. Booting with this configuration is easy, just select the option from the menu when starting the new instance. Proceeding in this way has several potential drawbacks, however. Most notably, it is very difficult to keep everything updated with this method. Unless you manually update the snapshot pretty often, your software is going to be out of date. Moreover, if you decide to make a small change in the scripts or programs you’re running on the instance, you will need to make an update to the snapshot.
Jun 2, 2016
Part of the reason that I am keeping this blog is to keep a record of the things I’ve done
and my thought process in doing them so that when it comes time to write up my thesis,
I will have a better recollection of what was going through my head. The other reason
is to perhaps help someone out there struggling similar problems that I went through.
I think that my adventures in the Google Cloud Platform are a good example of this –
Google’s Cloud Platform is acknowledged to be slightly less mature than some of its competitors, like AWS.
Because of this, there are fewer stackexchange questions, blogs posts, etc that can help guide
basic setup. I do think that Google’s documentation and tutorials are better than Amazon’s –
more accessible, better written – but it can be hard to figure out what you need to
be doing if you’re not a cloud professional. So I’ll document some of the hard steps
I encountered in this ongoing set of posts.
May 31, 2016
Using APIs and Data Services
This is the second installment in my series about finding data from new and different sources for use in your cartography or GIS projects. Last time I discussed looking through existing source code to find hidden datasets that might be useful. Today, I will walk through using an API service to tap into an organization’s database. As a simple google search will reveal, there are other resources, blogs, and tutorials out there that talk about how to use an API as a data source, but I will focus particularly on converting data from an API into a useful spatial data format that can be used in mapping and analysis. Tons of APIs have spatial data (usually latitude and longitude) attached to their responses, its just a matter of finding the data service and massaging it into the right format.
What is an API?
An API, which stands for Application Programming Interface, is a set of protocols and methods that define how two computers should talk to each other. An API is a documented set of building blocks (of code) that define how an existing application works. A programmer can put these blocks together to extend the existing program, or create a new app that uses portions of the existing program. Consider Twitter. Twitter is super popular, and a lot of people use it for various things – documenting every facet of their daily lives, reporting news, observing disasters and severe weather, etc. To build the platform, Twitter needed to make a whole bunch of computers talk to each other. When a user writes a tweet, it is sent to twitter’s central database, where it is stored, and then pushed back out to other clients. Multiply this by Twitter’s >310 million users, both reading and writing tweets, and you have a lot of clients that need to communicate with minimal friction.
May 22, 2016
I was recently tasked with writing a formal proposal for my thesis as the final paper in one of my courses. The final draft of the proposal was 21 single spaced pages. I figured I would write a shorter and perhaps more accessible summary of the work I am starting on so those of you out there that are curious what I’m working on don’t have to wade through that. If you do want to see the real thing, references, equations, and all, you can find it here.
In One Sentence:
I am attempting to develop a well-performing predictive model that, given a species distribution model user’s goals and requirements, determines the optimal computing configuration for that modeling routine by balancing the time spent modeling with the cost of the computing equipment used.
May 20, 2016
When running statistical models, like multiple linear regression or generalized linear models, it is typically not a good idea to use multiple predictor variables that are highly correlated with one another, as it may result in an unstable final model. This guideline also applies to many of various flavors of species distribution models (SDMs), which take in two or more (usually climatic) predictor variables to model a species response to environmental gradients. Given modeled climate output, the SDM models can be used to estimate the probability of a species occurring in a different time period, say at the last glacial maximum – 22,000 years ago–, or in 2100, once humans have contributed several degrees to the earth’s temperature. While these models differ in their statistical techniques, most behave a little like multiple regression, where a set of predict variables are combined in some parametric or non-parametric way to estimate the response as a function of these inputs. Thus, just like in a standard regression, using highly correlated climatic predictor variables can contribute to instability in the modeled response. This post describes the methods I chose to identify correlated variables and choose the ones I wanted to remain in my study.
May 8, 2016
Sometimes, it can be hard to find the data we want. We spend hours looking in all our normal places. We cruise the census bureau, hit the EPA’s data portal, and browse UW’s collection of geospatial resources. If you have a topic or a storyline in mind, it can be really frustrating to find that you can’t find the right data. In most cases, the data you seek is actually out there, it might just be a little harder to find that you might hope. I’ll discuss a couple of techniques I use from time to time when I find myself in this situation. This is my first blog post, so stay with me.