Deep learning - the "why" question.

There are many many deep learning models out there doing various things. Depending on the exact task they are solving, they may be constructed differently. Some will use convolution followed by pooling. Some will use several convolutional layers before there is any pooling layer. Some will use max-pooling. Some will use mean-pooling. Some will have a dropout added. Some will have a batch-norm layer here and there. Some will use sigmoid neurons, some will use half-recitfiers. Some will classify and therefore optimize for cross-entropy. Others will minimize mean-squared error. Some will use unpooling layers. Some will use deconvolutional layers. Some will use stochastic gradient descent with momentum. Some will use ADAM. Some will have RESNET layers, some will use Inception. The choices are plentiful (see e.g. here).

Reading any of these particular papers, one is faced with a set of choices the authors had made, followed by the evaluation on the dataset of their choice. The discussion of choices  typically refers strongly to papers where given techniques were first introduced, whereas the results section typically discusses in detail the previous state of the art. The shape of the architecture is often broken down into obvious and non obvious decisions. The obvious ones are dictated by the particular task the authors are trying to solve (e.g. when they have an autoencoding like task, they obviously use a form of an autoencoder).

The non obvious choices would include questions similar to those: Why did they use 3x3 conv followed by 1x1 conv and only then pooling? Why did they only replaced the 3 middle layers with MobileNet layers (ridiculous name BTW)? Why did they slap batch-norm only in the middle two layers and not all of them? Why did they use max-pooling in the first two layers and no pooling whatsoever in the following three?

Obvious stuff is not discussed because it is obvious, the non-obvious stuff is not discussed because ... let me get back to that in a moment.

In my opinion discussing these questions separates a paper from something at least shallowly scientific from complete charlatanry, even if the charlatanry appears to improve the results on the given dataset.

The sad truth, that few even talk about, is that in the vast majority of cases the answers to the why questions are purely empirical: they tried a bunch of models and these worked best - it is called "hyperparameter tuning" (or meta-parameter tuning). What does that tell us? A few things, first the authors are completely ignoring the danger of multiple hypothesis testing and generally piss on any statistical foundations of their "research". Second, they probably have more GPU's accessible than they know what to do with (very often they case in big companies these days). Third, they just want to stamp their names on some new record breaking benchmark, that obviously will be broken two weeks later by somebody who takes their model and does some extra blind tweaking, utilizing even more GPU power.

This is not science. This has more to do with people who build beefy PC's and submit their 3dMark results to hold a record for a few days. It is a craft, no doubt, but it is not science. The PC-builders don't make a pretense for this to be any science. The deep learning people do. They write what appears to be research papers, just to describe their GPU-rig and the result of their random meta-parameter search, with perhaps some tiny shreds of real scientific discussion. Benchmark results provide a nice cover, to claim that the paper is in some way "novel" and interesting, but truth to the mater is, they just overfitted that dataset some more. They might just as well memorize the entire dataset in their model and achieve 100% accuracy, who cares? (read my AI winter addendum post for some interesting literature on the subject).

Similarly to the difference between chemistry and alchemy, the scientific discussion is about building a concept, a theory that will enable one to make accurate predictions. Something to guide their experimental actions. Science does not need to make gold out of lead every time, or in the case of machine learning, a real scientific paper in this field does not need to beat some current benchmark. A scientific paper does not even need to answer any questions, if it happens to ask some good ones.

Now obviously there are exceptions, a small fraction of papers have interesting stuff in them. These are mostly the ones which try to show the deficits of deep learning and engage into a discussion as to why that might be the case.

So next time you read a deep learning paper, try to contemplate these quiet and never explained choices the authors have made. You'll be shocked to see how many of those are hidden between the lines.

If you found an error, highlight it and press Shift + Enter or click here to inform us.