Share this: ]]>

In principle here is how it works in statistics: you assume there is a distribution which yields samples. In the case of semantic segmentation, you assume there is a process that generates data, where the data can be annotated in a particular way. To simulate sampling from that distribution, a dataset is created and labeled. This dataset is a finite sample from that distribution. By measuring performance of a model, we ask to check a statistical hypothesis. The hypothesis is, that given a NEW sample of data, the model will perform at a set level with some high probability. This is rarely formulated this way, because these models and distributions are extremely complex and therefore there is no way of formally proving these types of hypotheses. Therefore an empirical cheat is being employed, in that the model gets evaluated on a set aside portion of data and from that it is assumed it will work roughly the same for any other unseen data. Now that itself is a violation of statistics, but in practical terms (engineering) this is assumed to be fine. And it is fine, as long as everyone remembers that this is just an approximation, a rough estimate of the actual performance of the model.

Things get more pathological when we keep reusing that single sample of data and claim to make some valuable scientific progress by improving marginally some measure of performance defined on that single finite sample.

First of all, due to the inherent finite nature of the sample, knowledge about the test portion gets leaked into the parameters of the model. This happens either directly - explicit hyper-parameter search, or indirectly - researchers re-running their experiments and tweaking models to obtain best performance. This leads to a situation in which the model works well on say Citiscapes (trained on one portion of that dataset and evaluated on another), but fares much worse on a NEW data, which may resemble Citiscapes and arguably come from the same underlying distribution, but being a NEW, yet unseen sample.

Again going back to the random paper I cited, they found that adding batch-norm here and there improve things on the citiscapes, but that might be bogus, might only work on that finite sample and may not generalize into new data. The literature is full of examples of failure of generalization e.g. from ImageNet to Pascal and so on. Also the paper (https://arxiv.org/pdf/1806.00451.pdf) explicitly shows this, by simulating a NEW sample from CIFAR10-like distribution.

In the most extreme case, a model can simply memorize the entire finite sample (dataset) and work 100% on it, and be a complete failure on any new data. This is particularly a problem with large models with millions of parameters such as deep learning, where again it has been shown they can indeed memorize a LOT (https://arxiv.org/abs/1611.03530).

So the bottom line - using test set is fine as long as we treat it as a rough approximation of the actual performance not ultimate oracle. Datasets should not be beaten to death for tens of years because inherently the knowledge leaks into models in the form of architecture decisions and hyper-parameters. Chasing extra 1% in some well known benchmark makes no sense from scientific (statistical) point of view, and makes little sense from practical point of view. In both these cases performance on new, unseen data is what really matters. Benchmark chasing is not scientific, more resembles some sports competition, which I guess is also one of the points in this post.

]]>In principle here is how it works in statistics: you assume there is a distribution which yields samples. In the case of semantic segmentation, you assume there is a process that generates data, where the data can be annotated in a particular way. To simulate sampling from that distribution, a dataset is created and labeled. This dataset is a finite sample from that distribution. By measuring performance of a model, we ask to check a statistical hypothesis. The hypothesis is, that given a NEW sample of data, the model will perform at a set level with some high probability. This is rarely formulated this way, because these models and distributions are extremely complex and therefore there is no way of formally proving these types of hypotheses. Therefore an empirical cheat is being employed, in that the model gets evaluated on a set aside portion of data and from that it is assumed it will work roughly the same for any other unseen data. Now that itself is a violation of statistics, but in practical terms (engineering) this is assumed to be fine. And it is fine, as long as everyone remembers that this is just an approximation, a rough estimate of the actual performance of the model.

Things get more pathological when we keep reusing that single sample of data and claim to make some valuable scientific progress by improving marginally some measure of performance defined on that single finite sample.

First of all, due to the inherent finite nature of the sample, knowledge about the test portion gets leaked into the parameters of the model. This happens either directly - explicit hyper-parameter search, or indirectly - researchers re-running their experiments and tweaking models to obtain best performance. This leads to a situation in which the model works well on say Citiscapes (trained on one portion of that dataset and evaluated on another), but fares much worse on a NEW data, which may resemble Citiscapes and arguably come from the same underlying distribution, but being a NEW, yet unseen sample.

Again going back to the random paper I cited, they found that adding batch-norm here and there improve things on the citiscapes, but that might be bogus, might only work on that finite sample and may not generalize into new data. The literature is full of examples of failure of generalization e.g. from ImageNet to Pascal and so on. Also the paper (https://arxiv.org/pdf/1806.00451.pdf) explicitly shows this, by simulating a NEW sample from CIFAR10-like distribution.

In the most extreme case, a model can simply memorize the entire finite sample (dataset) and work 100% on it, and be a complete failure on any new data. This is particularly a problem with large models with millions of parameters such as deep learning, where again it has been shown they can indeed memorize a LOT (https://arxiv.org/abs/1611.03530).

So the bottom line - using test set is fine as long as we treat it as a rough approximation of the actual performance not ultimate oracle. Datasets should not be beaten to death for tens of years because inherently the knowledge leaks into models in the form of architecture decisions and hyper-parameters. Chasing extra 1% in some well known benchmark makes no sense from scientific (statistical) point of view, and makes little sense from practical point of view. In both these cases performance on new, unseen data is what really matters. Benchmark chasing is not scientific, more resembles some sports competition, which I guess is also one of the points in this post.

]]>Having no controlled variables is generally impossible. The open question is what should be and what should not be a controlled variable in any experiment. That should be assessed on case by case basis according to our scientific knowledge

In the paper you posted they have batch normalization as a controlled variable when comparing several image segmentation techniques. Now we need to ask, is there a reasonable assumption that turning the batch normalization off would change the comparison of these methods? As they are quite similar and they even use the same optimization algorithm, it is unlikely. Why should we test for this variable then, what is the point, how does it help us compare the segmentation techniques any better?

]]>Having no controlled variables is generally impossible. The open question is what should be and what should not be a controlled variable in any experiment. That should be assessed on case by case basis according to our scientific knowledge

In the paper you posted they have batch normalization as a controlled variable when comparing several image segmentation techniques. Now we need to ask, is there a reasonable assumption that turning the batch normalization off would change the comparison of these methods? As they are quite similar and they even use the same optimization algorithm, it is unlikely. Why should we test for this variable then, what is the point, how does it help us compare the segmentation techniques any better?

]]>You see if the performance on the validation and test is too different and new dataset split is generated and a new hypothesis is tested. Only the ones that work well are being reported and that is what you read in papers and think it is legit (bad experiments, which there are plenty of are not talked about). This is entire field is data snooping at a massive scale. Again this paper https://arxiv.org/abs/1806.00451 shows where it leads.

]]>Why was batch normalization incorporated? Why is the number of groups used in ShuffleNet 3? Why was there L2 regularization of 5e−4? Why not 5e-5? You think a fairy came to them and gave them those numbers? No. They ran many versions and this is what they choose. Or perhaps not them, perhaps somebody they cite did. Bottom line, all these models as they evolve see the datasets they are being optimized on, and the whole split into validation and testing is red herring. One of the few papers that explores the result of this meta-overfitting is https://arxiv.org/abs/1806.00451

]]>