Reproducible plumbing

by Miklós Koren

“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis.” (interviewee in the seminal Kandel, Paepcke, Hellerstein and Heer interview study of business analytics practices)

In fact, my estimate is that about 80 percent of the work I do in an empirical research project is about getting, transforming, merging, or otherwise preparing data for the actual analysis.

This part of the research, which, for lack of a better word, I will call plumbing, should also be reproducible. Journal referees, editors and readers have come to expect that if I make a theoretical statement, I offer a proof. If I make a statistical claim, I back it up by a discussion of the methodology and offer software code for replication. The reproducability of plumbing, however, hinges on author statements like “we use the 2013 wave of the World Development Indicators” or “data comes from Penn World Tables 7.”

Most authors don’t make their data plumbing reproducible because reproducability is hard. Very hard. Data comes in various formats, some of the files are huge, and most researchers don’t speak a general-purpose programming language that could be used to automate the data transformation process. In fact, most data transformation is still ad hoc, pointing and clicking in Excel, copying and pasting and doing a bunch of VLOOKUPs. (For the record, VLOOKUPs are great.)

I have just finished preparing a fully programmatic replication package for recent study that is about to be published. Let me give you examples of the challenges this has brought up.

The movements of “reproducible research” and “open data” need standardized data APIs that can be programmatically queried (the World Bank Data API is the best example I have seen so far), and data manipulation tools that can ingest a variety of formats from a variety of sources. So far, we are still looking.


Now read this

Looking for a post office? How about a 150,000 of them?

by Miklós Koren In a recent project, we proxy 19th-century local economic development with the number of post offices in the neighborhood. We have downloaded all historical post office names and dates from Jim Forte’s Postal History... Continue →