The paradox of reproducibility in statistical research


One of the hallmarks of science is the reproducibility of results. It lies at the very foundation of our epistemology that objectivity of a result could only be assured if others are able to independently reproduce the experiment.

One could argue that science today actually has various issues with reproducibility, e.g. results obtained in a unique instrument (such as the LHC - Large Hydron Collider) cannot be reproduced anywhere, simply because nobody has another such instrument. At least in this case the results are in principle reproducible, and aside from the lack of another instrument, the basic scientific methodology can remain intact. Things get a bit more hairy with AI.

Determinism, reproducibility and randomness

The one hidden assumption with reproducibility is that the reality is roughly deterministic, and the results of the experiment depend deterministically on the experimental setup. After carefully tuning the initial conditions we expect the same experimental result. But things start to be more complex when our experiment itself is statistical in nature and relies on a random sample.

For example the experiment called elections: once the experiment is performed it cannot be reproduced, since the outcome of the first experiment affects substantially the system studied (society).

Such complex cases are often modeled by statistics, where we assume that the object we study is random, and we assume that each realization of the experiment could be completely unreproducible (since we have no control over the random variable), but given enough properties of the underlying processes we can still make some statement about the probability of particular events. Such statistical derivations are arguably weaker then deterministic statements - they can only assert things "with probability", nothing is ever certain.

Reproducibility of statistical result

In statistics reproducibility has to be relaxed. Imagine we test a hypothesis, that data obtained from some process has the property P. We conclude that our hypothesis is valid at a level of significance p=0.05. This means that we can estimate the the probability of obtaining our sample data under the assumption that our hypothesis is false (null hypothesis) is small (0.05). This however still means that one in every twenty experiments we will see the data to agree with our hypothesis and yet the hypothesis might be false. Imagine now we reproduce the result with new data, and the results agree (new data has property P). Does this mean the hypothesis is valid? It may, or maybe we simply by chance got another sample that happens to have the property P, yet the hypothesis is false. Depending on the type of hypothesis we test, we may get a sample that violates property P and yet statistically the hypothesis may still be valid. Things get very subtle.

New data, old data

Let's perform a simple experiment, choose some random numbers in python:

>>> import numpy as np
>>> S = np.random.randn(20)
>>> S
array([-1.90045341, -0.93594812, -0.64594106,  1.62681839, -1.55754457,
        0.70857714, -0.05837571,  1.36144584,  0.56760058,  1.09709778,
        1.49888013,  0.7241183 , -1.25662035, -0.70185234, -0.3946222 ,
       -0.51820878,  1.7173939 , -0.79465654,  1.53630409,  1.23098428])

For the moment let's set aside the discussion of whether these numbers are truly random or pseudo-random. S is a sample of data from some process, in this case the process is called np.random.randn(). Imagine this source is actually expensive and hard to get, hence the single sample S we have is valuable. We can now perform data analysis/mining/machine learning on it. Now we can simulate another sample by taking a random permutation of S:

>>> np.random.permutation(S)
array([ 1.7173939 , -1.90045341,  0.70857714, -0.51820878, -1.55754457,
       -0.79465654,  1.23098428, -0.05837571, -0.93594812,  1.53630409,
       -0.70185234,  0.7241183 , -0.3946222 ,  0.56760058, -0.64594106,
       -1.25662035,  1.09709778,  1.62681839,  1.49888013,  1.36144584])

Is this a valid sample from np.random.randn() ? Well if we know that the process generates independent numbers uniformly over the course of sampling, and hence the distribution is invariant with respect to permutation then certainly our new sample looks valid. But clearly we can see, that even though it looks like a new valid sample, it is not independent from our previous sample (since it contains the same elements) and therefore it limits what we can do with it.

Anyway, we can test some hypothesis here, e.g. that with some fairly high probability negative samples appear in the result of np.random.randn() equally frequently as positive samples and so on. In machine learning however the hypothesis (model) could be very complex, so it might be very difficult to estimate the p-value (particularly because we typically don't know the source distribution). In such case we can use another methodology (widely accepted in machine learning) and use cross validation i.e. validation on test set. So we split our sample S to S_train and S_test, use S_train to establish the model and S_test to evaluate. We can then randomly choose our partition and test the model in multitude of ways. But this methodology has important limits: we are ultimately testing on finite sample, but we are really interested how the model will perform on new data from the original process - which as we remember is difficult to obtain. To illustrate how easy it is to fall a victim of finite sample imagine that out model establishes the following hypothesis: no value in the data ever exceeds 2.0. This hypothesis happens to be valid in our finite sample, and we can cross-validate all we want and the result will always be the same - hypothesis is 100% valid. Obviously it is enough to repeat the original sampling once or twice to brutally find out that the hypothesis was absolutely wrong:

>>> S = np.random.randn(20)
>>> S
array([-0.96941176, -0.91026703,  0.04308561, -1.06180971, -0.76098517,
       -0.31288216,  0.60411325,  0.17335574,  0.7683286 ,  0.30599797,
        0.05474424,  2.1120005 ,  1.58270061,  2.2077597 , -0.00660874,
        0.23291533, -1.17122443,  0.47076028, -1.33029871, -0.65331696])

I hope in this simple example will help to see clearly that the 20 numbers we choose are insufficient to represent the hidden (Gaussian in this case) distribution. Working with finite samples (e.g. established datasets) allows for full reproducibility of results, but hides the danger of supporting flawed models that work only on that finite sample. This is nothing else but overfitting.

The paradox

The scientific approach in the case of machine learning dictates that the results should be reproducible, therefore a set of canonical datasets can be used to test things on.

But here comes the problem -  the models are inherently statistical. The training data is assumed to be a sample from some distribution, and we are concerned in performance on new, yet unknown sample from that distribution (if we are actually interested in practical applications and not just producing papers). Since analytic estimates with complex neural networks are often impossible, we use an empirical approach and we test the algorithms on samples of data. In order to perform an unbiased test, the test data should be new, unseen by the algorithm (and in fact even the researcher himself). Unless new data can be somehow generated, we have to use a trick, set aside some amount of data for testing. So it is typically practiced, thousands of papers report performance of statistical models on set aside portions of canonical datasets (such as e.g. the famous MNIST). Every researcher can download the dataset, the implement the algorithm and reproduce the claim of the paper, everything seems scientifically valid and happy.

However much like a random permutation of 20 numbers (originally chosen from Gaussian distribution) is not the same as having a new sample of 20 numbers, subset of handwritten digits chosen from MNIST is not the same as having a new sample of handwritten digits.

The performance on the same sample of data very quickly becomes irrelevant. In fact, evaluating models on the same data is "statistically illegal" and introduces biases as we seen above with our flawed hypothesis. In statistics such situation is referred to as multiple hypothesis testing and requires corrections to keep the results "unbiased". One example of such correction is Bonferoni correction which provides an estimate of statistical significance in a multiple hypothesis scenario. Enough said, every hypothesis test on the same data reduces statistical significance quite drastically.

Sadly few people in machine learning/AI space even heard of this at all, in fact I have won a bet (2 Australian dollars) that randomly approached and asked people on ICML (International Conference on Machine Learning) 2017 closing banquet would have never heard of "multiple hypothesis testing problem".

We cannot really test statistical models on fixed datasets, because is it statistically illegal. But science demands reproducibility and testing of statistical models on new random sample (new data) is by definition not reproducible. Absolute performance of such models therefore cannot be reproduced, what can be reproduced is a statistical test that with certain probability their performance lies in some interval.

Somehow this simple fact is a big surprise to a large number of machine learning researchers obsessed with measuring their models on canonical datasets. Ultimately almost all of these results lack any statistical significance (and therefore are invalid under statistical methodology). They are obviously reproducible, but that does not really matter -  it is like having a sorting algorithm than can sort a particular sequence of numbers (fails on others). Obviously such result is ultimately reproducible, but useless at the same time.

In machine learning this is a proverbial model that achieves 99% performance on MNIST and fails on any new sample of digit data.


This reproducibility paradox, compounded with general sensitivity of measures on high dimensional input as discussed in previous post makes all that measurement obsession look a lot more bleak. Not only there are multiple ways to formally measure particular "performance", each of them varying substantially in applicability, but most of these models are not even measured according to a proper statistical methodology. Not that doing proper new sample hypothesis testing is not prone to errors/cheats - simplest way of data dredging is to simply keep sampling data until we happen to have a positive result that "looks significant". But that aside, in Machine Learning no one even bothers playing those motions.

That said, on many occasions such "illegal" evaluations correlate with real performance. Ultimately ImageNet winning deep learning models are actually better then their more classical competitors. Also data competitions typically have a set aside secret data for unbiased testing to actually make the measurements significant. It is also often noted that the order of competition winners on validation dataset is different then the final order on this set aside test set. I am generally fine with that, it is not perfect but it will do, in fact it is quite sufficient for many applications. What I'm not OK with is acting like all the problems discussed above didn't exists and measurement achieved in these papers were some God given unquestionable truths and the only possible methodology. It is just one, imperfect methodology, prone to fatal flaws.

If you found an error, highlight it and press Shift + Enter or click here to inform us.