While rereading my recent post [the meta-parameter slot machine], as well as a few papers suggested by the readers in the comments, I've realized several things.
On the one hand we have Occam's Razor: choose only the simplest models for things. On the other hand we know that in order to build intelligence, we need to create a very complex artifact (namely something like a brain), that has to contain lots of memories (parameters). There is an inherent conflict between these two constraints.
Many faces of overfitting
If we have a model too complex for the task we often find it will overfit, since it has the capacity to "remember the training set". But things may not be so obvious in reality. For example there is another, counter intuitive situation where overfitting may hit us: the case where the model is clearly too simple to solve the task we have in mind, but the task as specified by the dataset is actually much simpler than what we had originally thought (and intended).
Let me explain this counterintuitive case with an example (an actual anecdote I heard from Simon Thorpe as far as I remember):
Figure 1. Example of bokeh around a picture of an animal.
Researchers were building a biologically inspired system for image classification. They used a dataset with pictures of animals and non-animals for binary classification. After training it turned out that their relatively simple classifier was able to achieve 100% performance. Clearly overfitting, but how is this possible, since the task (perceiving an animal) requires a fairly advanced cognition? Finally it occurred to them that there was a bias in the dataset which the classifier had exploited: all the pictures of animals were taken with a telephoto lens and centered. A simple feature: smeared image in the surround (telephoto bokeh) and sharp in the center was sufficient to discern the task and allowed to overfit on that spurious artifact. The task actually solved by the classifier - determining an image taken with a telephoto from one taken with a regular lens - had nothing to do with the task originally envisioned by the researchers.
I think this anecdote (actually I think it is a true story, just can't find any printed reference to it) is very telling. It illustrates that it is equally if not more important to consider carefully the task we ask a machine learning system to solve as it is the model itself. How many such spurious correlations are available in the ImageNet dataset? Since the image dimension is huge, there could be plenty of such things, and perhaps more complex combinations of spurious relations. Obviously there is a fuzzy border: some of these spurious, coincidental relations blend into actual semantic relations, such as: "a bird is an object often seen on blue background (sky)". Consequently it is not obvious that all such relations are wrong per se (such as the discovery of bokeh in the anecdote above). However they do not represent the semantics we want to convey by creating the set and one we would wish for a system to discover (such as an animal is something with a pair of eyes and a mouth etc.). As a result, the systems may generalize in bizarre directions, such as deep convolutional nets which generalize over funny colorful patterns while they can't recognize a sketch of a cat. To some degree analogously in reinforcement learning agents may exploit the reward functions if they are not defined really carefully (see e.g. a post by OpenAi)
Figure 2. A conceptual quad chart roughly mapping out task specificity and model complexity.
Simple ways to defeat complexity
How can we defeat such spurious discoveries? Well, clearly more data would help (in the case of the anecdote: add telephoto pictures of non-animals and wide angle shots of animals to break the bias). But in the case of vision (and any other task where the input dimension is huge) this may not be sufficient. The problem is the huge imbalance between the dimensionality of the input and of the label. This in turn does not provide enough constraints on the parameters, leaving many ways to separate the dataset, and the system chooses the separation that simply happens to be most statistically significant. Nothing else should be expected from a system optimized on a statistical measure. In addition, biases are often introduced unconsciously by people who create the dataset by means of various selections e.g. by removing miss-focussed or motion blurred images.
However, by solving a more general problem of physical prediction (to distinguish it from statistical prediction), the input and label get completely balanced and the problem of human selection disappears altogether. The label in such case is just a time shifted version of the raw input signal. More data means more signal, means better approximation of the actual data manifold. And since that manifold originated in the physical reality (no, it has not been sampled from a set of independent and identically distributed gaussians), it is no wonder that using physics as the training paradigm may help to unravel it correctly. Moreover, adding parameters should be balanced out by adding more constraints (more training signal). This is exactly what happens in a hierarchical predictive system such as the Predictive Vision Model (PVM).
That way, we should be able to build a very complex system with billions of parameters (memories) yet operating on a very simple and powerful principle. The complexity of the real signal and wealth of high dimensional training data may prevent it from ever finding "cheap", spurious solutions. But the cost we have to pay, is that we will need to solve a more general and complex task, which may not easily and directly translate to anything of practical importance, not instantly at least. Consequently there might be a time, when such approaches will not achieve as good results as the brittle and overfitting but sexy franken-models (think of a crappy but general dot matrix printer vs. the printing press for some arguably imperfect analogy). This is a longer term game, but a game we have to make bets on if we really seriously want to solve AI, at least solve it enough to build smart and robust robots.