Reproducible plumbing

by Miklós Koren

“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis.” (interviewee in the seminal Kandel, Paepcke, Hellerstein and Heer interview study of business analytics practices)

In fact, my estimate is that about 80 percent of the work I do in an empirical research project is about getting, transforming, merging, or otherwise preparing data for the actual analysis.

This part of the research, which, for lack of a better word, I will call plumbing, should also be reproducible. Journal referees, editors and readers have come to expect that if I make a theoretical statement, I offer a proof. If I make a statistical claim, I back it up by a discussion of the methodology and offer software code for replication. The reproducability of plumbing, however, hinges on author statements like “we use the 2013 wave of the World Development Indicators” or “data comes from Penn World Tables 7.”

Most authors don’t make their data plumbing reproducible because reproducability is hard. Very hard. Data comes in various formats, some of the files are huge, and most researchers don’t speak a general-purpose programming language that could be used to automate the data transformation process. In fact, most data transformation is still ad hoc, pointing and clicking in Excel, copying and pasting and doing a bunch of VLOOKUPs. (For the record, VLOOKUPs are great.)

I have just finished preparing a fully programmatic replication package for recent study that is about to be published. Let me give you examples of the challenges this has brought up.

The movements of “reproducible research” and “open data” need standardized data APIs that can be programmatically queried (the World Bank Data API is the best example I have seen so far), and data manipulation tools that can ingest a variety of formats from a variety of sources. So far, we are still looking.

 
9
Kudos
 
9
Kudos

Now read this

Which binary classification model is better?

by János K. Divényi Receiver Operating Characteristic curve is a great tool to visually illustrate the performance of a binary classifier. It plots the true positive rate (TPR) or the sensitivity against the false positive rate (FPR) or... Continue →