by Miklós Koren
“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis.” (interviewee in the seminal Kandel, Paepcke, Hellerstein and Heer interview study of business analytics practices)
In fact, my estimate is that about 80 percent of the work I do in an empirical research project is about getting, transforming, merging, or otherwise preparing data for the actual analysis.
This part of the research, which, for lack of a better word, I will call plumbing, should also be reproducible. Journal referees, editors and readers have come to expect that if I make a theoretical statement, I offer a proof. If I make a statistical claim, I back it up by a discussion of the methodology and offer software code for replication. The reproducability of plumbing, however, hinges on author statements like “we use the 2013 wave of the World Development Indicators” or “data comes from Penn World Tables 7.”
Most authors don’t make their data plumbing reproducible because reproducability is hard. Very hard. Data comes in various formats, some of the files are huge, and most researchers don’t speak a general-purpose programming language that could be used to automate the data transformation process. In fact, most data transformation is still ad hoc, pointing and clicking in Excel, copying and pasting and doing a bunch of VLOOKUPs. (For the record, VLOOKUPs are great.)
I have just finished preparing a fully programmatic replication package for recent study that is about to be published. Let me give you examples of the challenges this has brought up.
- Large datasets. The originals of the datasets I use are dozens of GB in size. By the end of my plumbing, I end up with a few hundred MBs, but if I want to make the whole process transparent and reproducible, I also need to show the original data.
- Inconsistent URLs and schema. The Spanish Agencia Tributaria is very helpful in publishing all their trade online. There is a lot of structure in how they store the files and what they contain, but every year there are a few inconsistencies to make me cringe and debug for hours. (For example, find the odd one out among the links here.)
- Country names. This is a special case of inconsistent schema. Every single data source uses their own codebook for identifying countries. In the best case, you get the 3-letter ISO-3166 code of the country, like
USA. These are great because they are a standard and quite human readable, right? Not so fast. Did you know that the 3-letter code changes when the country changes name? When Zaire became the Democratic Republic of the Congo, its code changed from
COD. The best would be to use the numeric codes of ISO-3166, which are fairly stable over time, but almost nobody uses these.
Undocumented and unsupported data on websites. The Doing Business project of the World Bank provides one of the greatest resources on cross-country data. But when they offer to “get all data,” they don’t actually mean it.
They have much more detailed data on their website which you cannot download and is not archived. These are, for example, the detailed costs of importing in Afghanistan in 2014, but the website doesn’t publish this data for earlier years. Luckily, web.archive.org comes to the rescue.
Big boxes of data. There is an 18MB .xls file I use from the 860MB .zip-file an author helpfully published on their website. The objective is laudable (like I said above, make everything available in the replication package), but I would prefer the option to download just what I need.
The movements of “reproducible research” and “open data” need standardized data APIs that can be programmatically queried (the World Bank Data API is the best example I have seen so far), and data manipulation tools that can ingest a variety of formats from a variety of sources. So far, we are still looking.