Problems with measuring performance in Machine Learning

With today's advancements in AI we often see media reports of superhuman performance in some task. These often quite dramatic announcements should however be treated with a dose of skepticism, as many of them may result purely from pathologies in measures applied to the problem. In this post I'd like to show what I mean by a "measurement pathology". I therefore constructed a simple example, which hopefully will  to get the point across.

Example: measuring lemons

Imagine somebody came to your machine learning lab/company with a following problem: identify lemons in a photo. This problems sounds clear enough, but in order to build an actual machine learning system that will accomplish such task, we have to formalize what this means in the form of a measure (of performance). The way this typically begins, is that some student will laboriously label the dataset. For the sake of this example, my dataset consists of a single image with approximately 50 lemons in it:

As mentioned the picture was carefully labeled:

With human labeled mask here:

Now that there is a ground truth label we can establish a measurement. One way to formally express the desire to identify lemons in this picture is to use the so called intersection over union (sometimes referred to as Jaccard index).

IoU = \frac{|L \cap R|}{|L \cup R|}

Where L is the label, R is the algorithm result (mask) and || is the area of set pixels.

This commonly used measure for object detection and semantic segmentation takes two binary masks - one ground truth human labeling and the other being the result of an algorithm. Next it calculates the surface area of intersection of these masks (pixels that are set in both human labeling and algorithm results) and divides this number by the surface area of the union of two masks (pixels that are set in either one or the other). As much as this may sound complicated it is in fact fairly intuitive. Very similar masks will have large overlap (intersection) - in the extreme case of two identical masks, their intersection will actually be identical to their union and therefore the measure will achieve perfect value of 1.0. The less there is overlap the worse should be the result, but we also need to penalize for false positives, therefore we divide by the surface of the entire union. E.g. a mask that is set everywhere, will have a very good overlap with ground truth, but the union (whole picture) will be huge and the result will be some small value. So for any practical purpose this seems like a great measure. So we begin constructing algorithms:

Algorithm 1 returns this mask:

with the value of intersection over union (IoU) equal to 0.909. We are happy with this result. Algorithm 2 returns this mask:

with intersection over union (IoU) result of 0.119 (these are actual numbers which you can verify by downloading these images and measuring everything yourself). We are not happy with this result, so we throw the algorithm 2 into the garbage bin, proceed to writing NIPS paper on algorithm 1 and in 90% of cases such machine learning exercises would end here.

Deeper inspection

Some weeks later the person who originally stated the problem gets back to our lab to ask for the result. We happily hand them over algorithm 1, show him the accepted NIPS paper and how great it works. They take it and deploy it, and come back to us saying it is actually crap. Turns out the original application that the person had in mind was to count lemons on the lemon tree in order to estimate the harvest. Now this actually changes everything -  turns out intersection over union is not a very good measure for this application! Actually it is a very bad measure!

Let us come back to the results of our algorithms once again from this new perspective: algorithm 1 had a very good intersection over union, but from the 51 lemons present in the human labeled data it only detected 39. Algorithm 2, even though was not very precise in outlining each lemon (and only generated a fixed sized spot inside each one it detected) actually detected 47 lemons. Turns out, for the application that was originally intended algorithm 2 is much better!

In reality, who knows how many algorithms with potentially great applications shared the fate of our algorithm 2, just because the established metrics or datasets did not favor them.

When a measure becomes a target, it ceases to be a good measure. - Goodhart's law

Other cases

There are numerous other cases where we may fall into similar traps (arguably my example above is a bit exaggerated to make it easy to see the problem). In fact in many (if not all) cases where somebody reports super-human level performance in some perceptual task, it only means that the algorithm over-optimized for the particular formal measure, whereas humans actually understand the objective of the experiment differently and therefore underperform. E.g. there have been numerous claims of algorithms achieving super human performance in image classification on ImageNet. Now these claims fade entirely if we apply different measurement methodology. Walter Scheirer for University of Notre Dame for example applied measurement typically used in psychophysics and got very much sub-human performance on state of the art deep nets. See his talk below:

Also take a look at my various posts such as "Just How Close we are to Solving Vision" and "Can a Deep Net See a Cat" among others. Similar problems plague natural language processing and pretty much any machine learning application. The paper "Machine Learning That Matters" by Kiri Wagstaff summarizes these and other problems, we also spend quite some time on that in the introduction to the PVM paper.

Measuring long tails

Things get additionally difficult when the application we deal with produces data with heavy tail distribution. This means, that unlike with e.g. Gaussian distributions, outliers cannot be ignored. Typically for a Normal Distribution an event that is 5-6 standard deviations away from the mean is so improbable that can be safely deemed as impossible. However, with long tail distributions, such as e.g. Pareto distribution, such statement cannot be made. Outliers are infrequent, but not nonexistent. As much as this may seem irrelevant, take the following example:

A self driving car uses an algorithm to semantically segment what it sees in front. A measure typically used for evaluation of performance in this case is our familiar intersection over union (over labels). Great.

Now imagine the car drives 100 miles, 99 of which are nice weather but the last mile is rain, fog and whatever bad conditions you could possibly imagine. Say you are being driven inside this car: would you rather have:

  1. An algorithm that achieves great overlaps (and therefore very high IoU) on the 99 perfect condition miles and fails completely on the last mile, ultimately crashing into a truck.
  2. An algorithm that gives somewhat rough outlines (though enough for safe driving) - consequently performs poorly on IoU, but deteriorates only mildly over the last mile and gets you home safely?

I suppose the answer is obvious. Avoiding rare, but dangerous situation is clearly the more important objective here, but at the same time it is horribly difficult to measure because rare and dangerous situations are above everything else rare (but not nonexistent!). Such measures are horribly difficult to optimize for, because the main objective to optimize against lies beyond where the vast majority of the distribution of data is concentrated. The only way to do it as of today is to hand-rebalance the data, which is not great since it essentially is like babysitting an algorithm.


Incidentally, there is one application where the measure of performance is clearly and unambiguously defined, and that is also where AI achieves it greatest successes - games. In games, the evaluation metric is part of the rules of the game - optimizing an AI agent for that objective will ultimately yield  a system that will win the game and therefore we can safely bet on victory (given enough GPU's and time). However even in those highly synthetic environments, objectives may have flaws and AI agents are often able to exploit them as nicely summarized in this Open AI post.

Measurement conundrum

A notable but often overlooked problem of Machine Learning, is that evaluation metrics are treated as some form of taboo's. Since the scientific approach tells us clearly that eyeballing data is not an objective measurement (which it obviously is not!), we are forced to come up with formal measures. But at the same time formal measures suffer from the problems discussed in this post. However, mere stating this fact often triggers panic, criticizing a formal measure is sometimes taken as a critique of the scientific method. This is obviously a misunderstanding. In fact the measures we use should be a subject of just as thorough scrutiny as the models we measure with them. When some measure proves irrelevant, it should be abandoned, even if it means we have to throw into garbage some amazing "super human AI" claim. The panic among researchers is obviously caused by the fact, that people often spend their entire careers optimizing performance on a given measure, and therefore dismissing the measure will obliterate all those results in one shot.


This little post was meant to highlight a problem: the ultimate result of our machine learning efforts is only worth as much as the measure we optimize for. If the measure is wrong, any amount of optimization will eventually be futile. There is no single good way of measuring performance, and pretty much every measure we apply will have its own pathologies. Optimizing beyond human level performance in some one specific measure may get us a paper published but might not actually be an achievement of any practical relevance - unless that specific measure is exactly what is needed for an application. The whole measurement obsession is a charade that is supposed to make this field look more scientific than it actually is - in reality those who come up with good measures to optimize win and there is no strict and systematic methodology on how to do that - it is more an art than a science.


If you found an error, highlight it and press Shift + Enter or click here to inform us.