CEU MicroData

Which binary classification model is better?

2015-06-12T08:27:16-07:00

by János K. Divényi

Receiver Operating Characteristic curve
is a great tool to visually illustrate the performance of a binary classifier.
It plots the true positive rate (TPR) or the sensitivity against the false
positive rate (FPR) or 1 - specificity. Usually, the algorithm gives you a
probability (e.g. simple
logistic regresssion),
so for classification you need to choose a cutoff point. The FPR-TPR pairs for
different values of the cutoff gives you the ROC curve. Non-informative
algorithms lie on the 45-degree line, as they classify the same fraction of
positives and negatives as positives, that is TPR = TP/P = FP/N = FPR.

But what if you want to compare two algorithms which give direct classification,
i.e. you have only two points in the plot? How to decide whether algorithm (2)
is better than algorithm (1)?

It is clear that algorithm (2) classifies more items as positive than algorithm
(1) and that this results in both more true positives (higher sensitivity) and
more false positives (lower specificity). Is this higher rate of positive
classification informative? What would we get if we just got labels negatively
classified by algorithm (1) and classified them randomly as positive? How would
we move from algorithm (1) on the graph?

The original negative-labels are both true negatives and false negatives. A
random classifier would classify them as positives in the same share. Let’s say
we are classifying an item as positive by probability k. Then, we are
going to have k TN new false positives and k FN new true
positives relative to algorithm (1). This results in the following measures for
the ROC curve:

Sensitivity’ = Sensitivity + k FN/P
1 - Specificity’ = 1 - Specificity + k TN/N

Thus, the slope of the movement is just

(1 - Sensitivity)/Specificity

It depends on k where we end up along this line. Therefore, it seems
that algorithm (2) not just randomly adds more positively-classified items but
also contains some information relative to algorithm (1).

We can arrive to the same conclusion by less calculation as well. A possible
random classification is to classify each item as positive (k = 1). If
we take all the negatively-labeled item and classify them “randomly” as
positive, we end up having only positively classified items which leads to
extreme values of specificity (0) and sensitivity (1). To connect the point with
the (1, 1) point we should draw a line with slope (1 -
Sensitivity)/Specificity.

Even Amazon needs economists

2014-12-01T04:41:56-08:00

by Miklós Koren

Suppose you’re Jeff Bezos contemplating a radical new thing for Amazon customers. How would you decide whether it’s worth it? Your engineers might put together a working prototype, roll it out to a bunch of randomly selected customers and do some A/B testing. After your data science team has crunched the numbers, you have your answer: the average customer will spend $X more on Amazon. X>Y, so you decide to scale up.

Not so fast. Without a model, no amount of A/B testing (or, for the science community, randomized control trials) and no amount of observational data are going to tell you how your customers will behave when they encounter something radically new. Models are useful for thinking about changes that have not happened before. A model might be a simple mathematical relationship summarized in a few formulas, a statistical model making assumptions about probabilities and distributions, or a theory of consumer behavior based on insights from economics and psychology.

How would economics, in particular, help you make a better decision? It brings two fundamental ideas to the table. First, that people respond to incentives. When you change the monetary, psychological or social incentives of your customers, they’re going to change their behavior. When shipping is free and fast, I may become a more impulsive shopper. Second, it offers a useful systems view of the world. Surely, when drones zigzag the sky with Amazon shipments, the entire shipping and retail industries will change, with potentially profound implications for manufacturing, as well. The way the shipping container transformed the world could not have been foreseen by A/B testing a prototype.

None of this is new. The Nobel Laureate Robert Lucas said as much in 1976, when macroeconomists were getting overly excited about predicting the future. Back in those days, large-scale forecasting models worked with around 50 variables, and it was easy to get carried away by the newfound computing power. Lucas cautioned that such atheoretical forecasts would not be useful for economic policy analysis.

Using big data will sometimes mean forgoing the quest for why in return for knowing what. (The Rise of Big Data)

Amidst all the enthusiasm for big data, Lucas’s warnings are once again relevant. And yes, even Amazon needs economists.

This blog post is based on my presentation at the Budapest BI Forum 2014.

Python in service of reproducibility

2014-09-19T04:04:32-07:00

by Rita Zágoni

The degree to which certain aspects of a research are reproducible depend on many factors. For intstance, in a complex research workflow it is hard to automate all the steps ranging from data preparation through analysis to packaging of data and code. Reducing the amount of tacit knowledge and ad hoc manual work can make the workflow easier to access and reproduce.

This is more realistic to achieve if we have the tools for each job which are accessible, possibly open source, and which support the various operations throughout the data pipeline in a more or less integrated manner. Our weapon of choice, the Python ecosystem provides many of these tools, offering a wide range of data manipulation and general purpose libraries which cover many steps in the data preparation and transformation process.

To get an overview of a Pythonic data analysis workflow, let us take a look at a subproject for structuring textual input. It is but one node in the dependency graph, but, as it happens in our fractalesque world of input-operation-output, follows the same pattern.

As raw data we have biographical information in semi-structured form (organized in fields which often contain free text) scattered in HTML files. We would like to extract the fields and values, combine them into a dataset, clean the data, then process the text, perform some statistical analysis and, incidentally, display the resulting summaries. In short, we would like to get to this:

from a lot of this:

A sample process in Python:

To extract data from HTML, we use the beautifulsoup4 third-party library.
We perform health check of data with pandas, a powerful data analysis library, which offers data cleaning tools for removing duplicates and handling missing values.
As the data is in shape, we move on to processing it using Python’s text analysis tools. This mainly involves slicing and gluing strings, exact pattern matching with regular expressions, fuzzy matching - for taming the ubiquitous typos - with difflib. These are all contained in the Python standard library.
Once we have some structured data we can turn back to pandas for analysis, such as getting distributions of values or other descriptive statistics.
In case we need visualization, 2D figures can be dynamically generated with the matplotlib library. Matplotlib is also partially integrated in pandas.

Reproducible plumbing

2014-09-11T05:03:13-07:00

by Miklós Koren

“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis.” (interviewee in the seminal Kandel, Paepcke, Hellerstein and Heer interview study of business analytics practices)

In fact, my estimate is that about 80 percent of the work I do in an empirical research project is about getting, transforming, merging, or otherwise preparing data for the actual analysis.

This part of the research, which, for lack of a better word, I will call plumbing, should also be reproducible. Journal referees, editors and readers have come to expect that if I make a theoretical statement, I offer a proof. If I make a statistical claim, I back it up by a discussion of the methodology and offer software code for replication. The reproducability of plumbing, however, hinges on author statements like “we use the 2013 wave of the World Development Indicators” or “data comes from Penn World Tables 7.”

Most authors don’t make their data plumbing reproducible because reproducability is hard. Very hard. Data comes in various formats, some of the files are huge, and most researchers don’t speak a general-purpose programming language that could be used to automate the data transformation process. In fact, most data transformation is still ad hoc, pointing and clicking in Excel, copying and pasting and doing a bunch of VLOOKUPs. (For the record, VLOOKUPs are great.)

I have just finished preparing a fully programmatic replication package for recent study that is about to be published. Let me give you examples of the challenges this has brought up.

Large datasets. The originals of the datasets I use are dozens of GB in size. By the end of my plumbing, I end up with a few hundred MBs, but if I want to make the whole process transparent and reproducible, I also need to show the original data.
Inconsistent URLs and schema. The Spanish Agencia Tributaria is very helpful in publishing all their trade online. There is a lot of structure in how they store the files and what they contain, but every year there are a few inconsistencies to make me cringe and debug for hours. (For example, find the odd one out among the links here.)
Country names. This is a special case of inconsistent schema. Every single data source uses their own codebook for identifying countries. In the best case, you get the 3-letter ISO-3166 code of the country, like HUN and USA. These are great because they are a standard and quite human readable, right? Not so fast. Did you know that the 3-letter code changes when the country changes name? When Zaire became the Democratic Republic of the Congo, its code changed from ZAR to COD. The best would be to use the numeric codes of ISO-3166, which are fairly stable over time, but almost nobody uses these.
Undocumented and unsupported data on websites. The Doing Business project of the World Bank provides one of the greatest resources on cross-country data. But when they offer to “get all data,” they don’t actually mean it.

They have much more detailed data on their website which you cannot download and is not archived. These are, for example, the detailed costs of importing in Afghanistan in 2014, but the website doesn’t publish this data for earlier years. Luckily, web.archive.org comes to the rescue.
Big boxes of data. There is an 18MB .xls file I use from the 860MB .zip-file an author helpfully published on their website. The objective is laudable (like I said above, make everything available in the replication package), but I would prefer the option to download just what I need.

The movements of “reproducible research” and “open data” need standardized data APIs that can be programmatically queried (the World Bank Data API is the best example I have seen so far), and data manipulation tools that can ingest a variety of formats from a variety of sources. So far, we are still looking.

2+2<4

2014-09-02T01:38:33-07:00

by Krisztián Fekete

Being precise in describing how we achieve a result is hard.

It is not optional, though. Let’s look at the inequality 2+2<4.
Since we have learnt 2+2=4 in primary school, the inequality looks strange. It might become true, though, if we extend it with further information–what the units are and what we mean by the + operation.

2+2<4 might be a short-hand for some mud-making specialists for 2 liters of sand mixed with 2 liters of water gives less than 4 liters of mud, but no one else except the mud-makers would understand it (we get less than 4 liters of mud, as the sand drinks up some water).

For a result to be meaningful and reproducible, both input (such as 2 liters of sand and 2 liters of water) and operation (such as mixing the two substances) are important to be given explicitly, otherwise we will have not have the expected result (mud with a total volume of less than 4 liters).

In our work we do a lot of data transformations and calculations. Our result is data, our input is also data, our operation is a computer program.
Input, code and output are strongly related to each other (as in the mud example above), so it makes sense to pack them together.

Our input data is usually also a result of some calculations, so we end up with this model:

This is a simple dependency graph, where the data is packed together with the code producing it.

When there is no intermediate data (there is only one program), this setup occurs naturally, as it is the easiest thing to do.

For more complex workflows, we have abandoned this model and separated code (version control, github) and data (file servers) and thus have to maintain links between the two (problematic!). As if code likes to live with code and data with data.

To be able to return to sanity (with packages like on the picture) we need

tools for working with packages consisting of
- data
- code that produces it
- input data references
to easily reference other packages (thus accessing their data for input)
to easily re-create a package with updated inputs
tool to verify packages, that the included code indeed produce the output
tool to export a whole workflow as a code, so that we can run it in one step
potentially all of this to work offline without a server

We have been thinking about this problem for some time now, and will keep you posted about our solutions.

Looking for a post office? How about a 150,000 of them?

2014-08-26T09:17:37-07:00

by Miklós Koren

In a recent project, we proxy 19th-century local economic development with the number of post offices in the neighborhood. We have downloaded all historical post office names and dates from Jim Forte’s Postal History (with his permission) to create a geospatial database.

How many post offices were there in Kings County, NY, in 1880? In answering this question, we faced a computational challenge: spatially querying 150,000 points is not trivial. The naive algorithm would go through all points and make a spatial comparison with the polygon: “Is this point inside the polygon?” This is clearly an inferior solution. When we asked QGIS to calculate the number of post offices by counties, it hadn’t finished in three hours. Maybe I am impatient, but this seemed unacceptable to me. So I started researching the quadtree algorithm.

The basic idea of a quadtree is that it splits the plane in four quadrants. Much like binary search, this makes searching faster. log(N) fast, to be precise.

Finding points within a polygon is trivial if the bounding box of all points is inside.

Much less so if some of the points are outside.

However, if the points are stored in a quadtree, we will find that the bottom-right quadrant of rectangle is inside the polygon. And so are the two post offices inside.

To find the remaining points, we keep splitting the quadrants into further and further quadrants.

I implemented this algorithm in a python module. You instantiate the QuadTree class with a list of points (which may come as pairs of coordinates or as a GeoJSON string), which builds a quadtree with at most 11 points in each quadrant. You can than query this quadtree with any polygon like this:

post_offices = QuadTree([...])
kings_county = Feature(...)
print post_offices.count_overlapping_points(kings_county)

If post offices were equally distributed in space, the quadtree would be at most 7 layers deep. This means drastically fewer spatial comparisons.

As a result, I could count and list post offices in a matter of a seconds instead of a hours.