tag:blog.microdata.io,2014:/feedCEU MicroData2015-06-12T08:27:16-07:00CEU MicroDatahttp://blog.microdata.ioSvbtle.comtag:blog.microdata.io,2014:Post/which-binary-classification-model-is-better2015-06-12T08:27:16-07:002015-06-12T08:27:16-07:00Which binary classification model is better?<p>by <em>János K. Divényi</em></p>
<p><a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic" rel="nofollow">Receiver Operating Characteristic curve</a> <br>
is a great tool to visually illustrate the performance of a binary classifier. <br>
It plots the true positive rate (TPR) or the sensitivity against the false <br>
positive rate (FPR) or 1 - specificity. Usually, the algorithm gives you a<br>
probability (e.g. simple <br>
<a href="http://en.wikipedia.org/wiki/Logistic_regression" rel="nofollow">logistic regresssion</a>),<br>
so for classification you need to choose a cutoff point. The FPR-TPR pairs for<br>
different values of the cutoff gives you the ROC curve. Non-informative<br>
algorithms lie on the 45-degree line, as they classify the same fraction of<br>
positives and negatives as positives, that is TPR = TP/P = FP/N = FPR.</p>
<p>But what if you want to compare two algorithms which give direct classification,<br>
i.e. you have only two points in the plot? How to decide whether algorithm (2)<br>
is better than algorithm (1)?</p>
<p><a href="https://svbtleusercontent.com/xhwbv02bo2prq.png" rel="nofollow"><img src="https://svbtleusercontent.com/xhwbv02bo2prq_small.png" alt="roc.png"></a></p>
<p>It is clear that algorithm (2) classifies more items as positive than algorithm<br>
(1) and that this results in both more true positives (higher sensitivity) and<br>
more false positives (lower specificity). Is this higher rate of positive<br>
classification informative? What would we get if we just got labels negatively<br>
classified by algorithm (1) and classified them randomly as positive? How would <br>
we move from algorithm (1) on the graph?</p>
<p>The original negative-labels are both true negatives and false negatives. A <br>
random classifier would classify them as positives in the same share. Let’s say<br>
we are classifying an item as positive by probability <em>k</em>. Then, we are<br>
going to have <em>k</em> TN new false positives and <em>k</em> FN new true<br>
positives relative to algorithm (1). This results in the following measures for<br>
the ROC curve:</p>
<p>Sensitivity’ = Sensitivity + <em>k</em> FN/P <br>
1 - Specificity’ = 1 - Specificity + <em>k</em> TN/N</p>
<p>Thus, the slope of the movement is just</p>
<p>(1 - Sensitivity)/Specificity</p>
<p>It depends on <em>k</em> where we end up along this line. Therefore, it seems<br>
that algorithm (2) not just randomly adds more positively-classified items but <br>
also contains some information relative to algorithm (1).</p>
<p><a href="https://svbtleusercontent.com/ibd4keu77t3ojw.png" rel="nofollow"><img src="https://svbtleusercontent.com/ibd4keu77t3ojw_small.png" alt="roc_random.png"></a></p>
<p>We can arrive to the same conclusion by less calculation as well. A possible<br>
random classification is to classify each item as positive (<em>k</em> = 1). If<br>
we take all the negatively-labeled item and classify them “randomly” as<br>
positive, we end up having only positively classified items which leads to<br>
extreme values of specificity (0) and sensitivity (1). To connect the point with<br>
the (1, 1) point we should draw a line with slope (1 -<br>
Sensitivity)/Specificity.</p>
tag:blog.microdata.io,2014:Post/even-amazon-needs-economists2014-12-01T04:41:56-08:002014-12-01T04:41:56-08:00Even Amazon needs economists<p><em>by <a href="http://miklos.koren.hu" rel="nofollow">Miklós Koren</a></em></p>
<p><a href="https://svbtleusercontent.com/ecb57knfeuoeeg.png" rel="nofollow"><img src="https://svbtleusercontent.com/ecb57knfeuoeeg_small.png" alt="amazon-drone.png"></a></p>
<p>Suppose you’re Jeff Bezos contemplating a radical new thing for Amazon customers. How would you decide whether it’s worth it? Your engineers might put together a working prototype, roll it out to a bunch of randomly selected customers and do some A/B testing. After your data science team has crunched the numbers, you have your answer: the average customer will spend $X more on Amazon. X>Y, so you decide to scale up.</p>
<p>Not so fast. Without a model, no amount of A/B testing (or, for the science community, randomized control trials) and no amount of observational data are going to tell you how your customers will behave when they encounter something radically new. Models are useful for thinking about changes that have not happened before. A model might be a simple mathematical relationship summarized in a few formulas, a statistical model making assumptions about probabilities and distributions, or a theory of consumer behavior based on insights from economics and psychology.</p>
<p>How would economics, in particular, help you make a better decision? It brings two fundamental ideas to the table. First, that <a href="http://en.wikiversity.org/wiki/10_Principles_of_Economics#People_respond_to_incentives" rel="nofollow">people respond to incentives</a>. When you change the monetary, psychological or social incentives of your customers, they’re going to change their behavior. When shipping is free and fast, I may become a more impulsive shopper. Second, it offers a useful systems view of the world. Surely, when drones zigzag the sky with Amazon shipments, the entire shipping and retail industries will change, with potentially profound implications for manufacturing, as well. The way the <a href="http://press.princeton.edu/titles/9383.html" rel="nofollow">shipping container</a> transformed the world could not have been foreseen by A/B testing a prototype.</p>
<p>None of this is new. The Nobel Laureate <a href="http://en.wikipedia.org/wiki/Lucas_critique" rel="nofollow">Robert Lucas</a> said as much in 1976, when macroeconomists were getting overly excited about predicting the future. Back in those days, <a href="http://papers.nber.org/books/hick72-1" rel="nofollow">large-scale forecasting models</a> worked with around 50 variables, and it was easy to get carried away by the newfound computing power. Lucas cautioned that such atheoretical forecasts would not be useful for economic policy analysis.</p>
<blockquote class="short">
<p>Using big data will sometimes mean forgoing the quest for why in return for knowing what. (<a href="http://www.foreignaffairs.com/articles/139104/kenneth-neil-cukier-and-viktor-mayer-schoenberger/the-rise-of-big-data" rel="nofollow">The Rise of Big Data</a>)</p>
</blockquote>
<p>Amidst all the enthusiasm for big data, Lucas’s warnings are once again relevant. And yes, even Amazon <a href="https://www.aeaweb.org/joe/listing.php?JOE_ID=2014-02_111451682" rel="nofollow">needs economists</a>.</p>
<p><em>This blog post is based on <a href="https://speakerdeck.com/korenmiklos/what-role-for-economics-in-predictive-analytics" rel="nofollow">my presentation</a> at the Budapest BI Forum 2014.</em></p>
tag:blog.microdata.io,2014:Post/python-for-reproducibility2014-09-19T04:04:32-07:002014-09-19T04:04:32-07:00Python in service of reproducibility<p>by <i>Rita Zágoni</i></p>
<p>The degree to which certain aspects of a research are reproducible depend on many factors. For intstance, in a complex research workflow it is hard to automate all the steps ranging from <a href="http://blog.microdata.io/reproducible-plumbing" rel="nofollow">data preparation</a> through analysis to <a href="http://blog.microdata.io/224" rel="nofollow">packaging of data and code</a>. Reducing the amount of tacit knowledge and ad hoc manual work can make the workflow easier to access and reproduce. </p>
<p>This is more realistic to achieve if we have the tools for each job which are accessible, possibly open source, and which support the various operations throughout the data pipeline in a more or less integrated manner. Our weapon of choice, the Python ecosystem provides many of these tools, offering a wide range of data manipulation and general purpose libraries which cover many steps in the data preparation and transformation process.</p>
<p>To get an overview of a Pythonic data analysis workflow, let us take a look at a subproject for structuring textual input. It is but one node in the <a href="http://blog.microdata.io/224" rel="nofollow">dependency graph</a>, but, as it happens in our fractalesque world of input-operation-output, follows the same pattern.</p>
<p>As raw data we have biographical information in semi-structured form (organized in fields which often contain free text) scattered in HTML files. We would like to extract the fields and values, combine them into a dataset, clean the data, then process the text, perform some statistical analysis and, incidentally, display the resulting summaries. In short, we would like to get to this:</p>
<p><a href="https://svbtleusercontent.com/qafseyj5jnyq6a.png" rel="nofollow"><img src="https://svbtleusercontent.com/qafseyj5jnyq6a_small.png" alt="Children and respondents birthyear.png"></a></p>
<p>from a lot of this:</p>
<p><a href="https://svbtleusercontent.com/npoubbju5ifaw.jpg" rel="nofollow"><img src="https://svbtleusercontent.com/npoubbju5ifaw_small.jpg" alt="html.jpg"></a></p>
<p>A sample process in Python:</p>
<ol>
<li>To extract data from HTML, we use the <a href="https://pypi.python.org/pypi/beautifulsoup4/4.3.2" rel="nofollow">beautifulsoup4</a> third-party library.</li>
<li>We perform health check of data with <a href="https://pypi.python.org/pypi/pandas" rel="nofollow">pandas</a>, a powerful data analysis library, which offers data cleaning tools for removing duplicates and handling missing values. </li>
<li>As the data is in shape, we move on to processing it using Python’s text analysis tools. This mainly involves slicing and gluing strings, exact pattern matching with regular expressions, fuzzy matching - for taming the ubiquitous typos - with difflib. These are all contained in the Python standard library.</li>
<li>Once we have some structured data we can turn back to pandas for analysis, such as getting distributions of values or other descriptive statistics. </li>
<li>In case we need visualization, 2D figures can be dynamically generated with the <a href="https://pypi.python.org/pypi/matplotlib/1.4.0" rel="nofollow">matplotlib</a> library. Matplotlib is also partially integrated in pandas. </li>
</ol>
tag:blog.microdata.io,2014:Post/reproducible-plumbing2014-09-11T05:03:13-07:002014-09-11T05:03:13-07:00Reproducible plumbing<p>by <em><a href="http://miklos.koren.hu/" rel="nofollow">Miklós Koren</a></em></p>
<blockquote>
<p>“I spend more than half of my time integrating, cleansing and transforming data without doing any actual analysis.” (interviewee in the <a href="http://db.cs.berkeley.edu/papers/vast12-interview.pdf" rel="nofollow">seminal Kandel, Paepcke, Hellerstein and Heer interview study</a> of business analytics practices)</p>
</blockquote>
<p>In fact, my estimate is that about 80 percent of the work I do in an empirical research project is about getting, transforming, merging, or otherwise preparing data for the actual analysis. </p>
<p>This part of the research, which, for lack of a better word, I will call <em>plumbing</em>, should also be reproducible. Journal referees, editors and readers have come to expect that if I make a theoretical statement, I offer a proof. If I make a statistical claim, I back it up by a discussion of the methodology and offer software code for replication. The reproducability of plumbing, however, hinges on author statements like “we use the 2013 wave of the World Development Indicators” or “data comes from Penn World Tables 7.”</p>
<p>Most authors don’t make their data plumbing reproducible because <a href="/224" rel="nofollow">reproducability is hard</a>. Very hard. Data comes in various formats, some of the files are huge, and most researchers don’t speak a general-purpose programming language that could be used to automate the data transformation process. In fact, most data transformation is still <em>ad hoc</em>, pointing and clicking in Excel, copying and pasting and doing a bunch of VLOOKUPs. (For the record, VLOOKUPs are great.)</p>
<p>I have just finished preparing a fully programmatic replication package for <a href="http://miklos.koren.hu/papers/peer_reviewed_publications/administrative_barriers_to_trade/" rel="nofollow">recent study</a> that is about to be published. Let me give you examples of the challenges this has brought up.</p>
<ul>
<li>Large datasets. The originals of the datasets I use are dozens of GB in size. By the end of my plumbing, I end up with a few hundred MBs, but if I want to make the whole process transparent and reproducible, I also need to show the original data.</li>
<li>Inconsistent URLs and schema. The Spanish <em>Agencia Tributaria</em> is very helpful in publishing <em>all</em> their trade online. There is a lot of structure in how they store the files and what they contain, but every year there are a few inconsistencies to make me cringe and debug for hours. (For example, find the odd one out among the <a href="http://www.agenciatributaria.es/AEAT.internet/Inicio_es_ES/La_Agencia_Tributaria/Memorias_y_estadisticas_tributarias/Estadisticas/Comercio_exterior/Datos_estadisticos/Descarga_de_Datos_Estadisticos/Descarga_de_datos_mensuales_maxima_desagregacion_en_Euros__centimos_/2009/Enero/Enero.shtml" rel="nofollow">links here</a>.)<br>
</li>
<li>Country names. This is a special case of inconsistent schema. Every single data source uses their own codebook for identifying countries. In the best case, you get the 3-letter ISO-3166 code of the country, like <code class="prettyprint">HUN</code> and <code class="prettyprint">USA</code>. These are great because they are a standard and quite human readable, right? Not so fast. Did you know that the 3-letter code changes when the country changes name? When Zaire became the Democratic Republic of the Congo, its <a href="https://www.iso.org/obp/ui/#iso:code:3166:ZR" rel="nofollow">code changed from <code class="prettyprint">ZAR</code> to <code class="prettyprint">COD</code></a>. The best would be to use the <a href="http://en.wikipedia.org/wiki/ISO_3166-1_numeric" rel="nofollow"><em>numeric codes</em> of ISO-3166</a>, which are fairly stable over time, but almost nobody uses these.</li>
<li><p>Undocumented and unsupported data on websites. The <a href="http://doingbusiness.org/" rel="nofollow">Doing Business</a> project of the World Bank provides one of the greatest resources on cross-country data. But when they offer to “get all data,” they don’t actually mean it.<br>
<a href="https://svbtleusercontent.com/raok17cgymh1ag.png" rel="nofollow"><img src="https://svbtleusercontent.com/raok17cgymh1ag_small.png" alt="get-all-data.png"></a><br>
They have much more detailed data on their website which you cannot download and is not archived. These are, for example, the detailed costs of importing in Afghanistan in 2014, but the website doesn’t publish this data for earlier years. Luckily, <a href="http://web.archive.org/web/20091003023159/http://www.doingbusiness.org/ExploreTopics/TradingAcrossBorders/Details.aspx?economyid=2" rel="nofollow">web.archive.org</a> comes to the rescue.<br>
<a href="https://svbtleusercontent.com/2v9xxqn9u2w.png" rel="nofollow"><img src="https://svbtleusercontent.com/2v9xxqn9u2w_small.png" alt="detailed.png"></a></p></li>
<li><p>Big boxes of data. There is an 18MB .xls file I use from the 860MB .zip-file an author helpfully published on their website. The objective is laudable (like I said above, make everything available in the replication package), but I would prefer the option to download just what I need.</p></li>
</ul>
<p>The movements of “reproducible research” and “open data” need standardized data APIs that can be programmatically queried (the <a href="http://data.worldbank.org/developers/api-overview" rel="nofollow">World Bank Data API</a> is the best example I have seen so far), and data manipulation tools that can ingest a variety of formats from a variety of sources. So far, we are still looking.</p>
tag:blog.microdata.io,2014:Post/2242014-09-02T01:38:33-07:002014-09-02T01:38:33-07:002+2<4<p><em>by Krisztián Fekete</em></p>
<p>Being precise in describing how we achieve a result is hard. </p>
<p>It is not optional, though. Let’s look at the inequality <code class="prettyprint">2+2<4</code>.<br>
Since we have learnt <code class="prettyprint">2+2=4</code> in primary school, the inequality looks strange. It might become true, though, if we extend it with further information–what the units are and what we mean by the + operation.</p>
<p><code class="prettyprint">2+2<4</code> might be a short-hand for some mud-making specialists for <code class="prettyprint">2 liters of sand mixed with 2 liters of water gives less than 4 liters of mud</code>, but no one else except the mud-makers would understand it (we get less than 4 liters of mud, as the sand drinks up some water).</p>
<p>For a result to be meaningful and reproducible, both <em>input</em> (such as 2 liters of sand and 2 liters of water) and <em>operation</em> (such as mixing the two substances) are important to be given <em>explicitly</em>, otherwise we will have not have the expected <em>result</em> (mud with a total volume of less than 4 liters).</p>
<hr>
<p>In our work we do a lot of data transformations and calculations. Our <em>result</em> is data, our <em>input</em> is also data, our <em>operation</em> is a computer program.<br>
Input, code and output are strongly related to each other (as in the mud example above), so it makes sense to pack them together.</p>
<p>Our input data is usually also a result of some calculations, so we end up with this model:<br>
<a href="https://svbtleusercontent.com/zupg4o3jb5hbkw.png" rel="nofollow"><img src="https://svbtleusercontent.com/zupg4o3jb5hbkw_small.png" alt="datapackage.png"></a><br>
This is a simple dependency graph, where the data is packed together with the code producing it.</p>
<p>When there is no intermediate data (there is only one program), this setup occurs naturally, as it is the easiest thing to do.</p>
<p>For more complex workflows, we have abandoned this model and separated code (version control, github) and data (file servers) and thus have to maintain links between the two (problematic!). As if code likes to live with code and data with data.</p>
<p>To be able to return to sanity (with packages like on the picture) we need</p>
<ul>
<li>tools for working with packages consisting of
<ul>
<li>data</li>
<li>code that produces it</li>
<li>input data references</li>
</ul>
</li>
<li>to easily reference other packages (thus accessing their data for input)</li>
<li>to easily re-create a package with updated inputs</li>
<li>tool to verify packages, that the included code indeed produce the output</li>
<li>tool to export a whole workflow as a code, so that we can run it in one step</li>
<li>potentially all of this to work offline without a server</li>
</ul>
<p>We have been thinking about this problem for some time now, and will keep you posted about our solutions.</p>
tag:blog.microdata.io,2014:Post/looking-for-a-post-office-quadtree-to-the-rescue2014-08-26T09:17:37-07:002014-08-26T09:17:37-07:00Looking for a post office? How about a 150,000 of them?<p><em>by <a href="http://miklos.koren.hu/" rel="nofollow">Miklós Koren</a></em></p>
<p>In a <a href="http://miklos.koren.hu/papers/working_papers/bridges/" rel="nofollow">recent project</a>, we proxy 19th-century local economic development with the number of post offices in the neighborhood. We have downloaded all historical post office names and dates from <a href="http://www.postalhistory.com/postoffices.asp" rel="nofollow">Jim Forte’s Postal History</a> (with his permission) to create a <a href="https://ceumicrodata.cartodb.com/tables/postoffices/public" rel="nofollow">geospatial database</a>.</p>
<p>How many post offices were there in Kings County, NY, in 1880? In answering this question, we faced a computational challenge: spatially querying 150,000 points is not trivial. The naive algorithm would go through all points and make a spatial comparison with the polygon: “Is this point inside the polygon?” This is clearly an inferior solution. When we asked <a href="http://www.qgis.org/en/site/" rel="nofollow">QGIS</a> to calculate the number of post offices by counties, it hadn’t finished in three hours. Maybe I am impatient, but this seemed unacceptable to me. So I started researching the <a href="http://en.wikipedia.org/wiki/Quadtree" rel="nofollow">quadtree</a> algorithm.</p>
<p>The basic idea of a quadtree is that it splits the plane in four quadrants. Much like binary search, this makes searching faster. <code class="prettyprint">log(N)</code> fast, to be precise. </p>
<p>Finding points within a polygon is trivial if the bounding box of all points is inside.<br>
<a href="https://svbtleusercontent.com/ydvuvu4ojngzxw.png" rel="nofollow"><img src="https://svbtleusercontent.com/ydvuvu4ojngzxw_small.png" alt="easy.png"></a><br>
Much less so if some of the points are outside.<br>
<a href="https://svbtleusercontent.com/jq8k3upzn05fha.png" rel="nofollow"><img src="https://svbtleusercontent.com/jq8k3upzn05fha_small.png" alt="hard.png"></a><br>
However, if the points are stored in a quadtree, we will find that the bottom-right quadrant of rectangle is inside the polygon. And so are the two post offices inside.<br>
<a href="https://svbtleusercontent.com/uouq04aojpbueg.png" rel="nofollow"><img src="https://svbtleusercontent.com/uouq04aojpbueg_small.png" alt="quadtree.png"></a> <br>
To find the remaining points, we keep splitting the quadrants into further and further quadrants. </p>
<p>I implemented this algorithm in a <a href="https://github.com/ceumicrodata/quadtree" rel="nofollow">python module</a>. You instantiate the <code class="prettyprint">QuadTree</code> class with a list of points (which may come as pairs of coordinates or as a <a href="http://geojson.org/" rel="nofollow">GeoJSON</a> string), which builds a quadtree with at most 11 points in each quadrant. You can than query this quadtree with any polygon like this:</p>
<pre><code class="prettyprint">post_offices = QuadTree([...])
kings_county = Feature(...)
print post_offices.count_overlapping_points(kings_county)
</code></pre>
<p>If post offices were equally distributed in space, the quadtree would be at most 7 layers deep. This means drastically fewer spatial comparisons.</p>
<p>As a result, I could count and list post offices in a matter of a seconds instead of a hours. </p>