Yann LeCun, the inventor of the convolutional networks has given a talk at CMU robotics institute which was conveniently recorded and made available to the general public here:
Although the talk is over 1h long, it is certainly worth watching and I strongly recommend doing that before you read any of the following text.
After the lecture
Yann LeCun is a rather colourful character and certainly has strong opinions on many subjects. I find myself at any given time either strongly agreeing or strongly disagreeing with him and it's no surprise it is the same this time around. Anyway, he makes several points in his talk which I think are relevant to our published work on PVM (PVM paper for details) and worth more detailed comment.
- After a brief overview of the state of the art in machine learning and AI, LeCun goes on to talk about more cutting edge stuff. He notes that the next important frontier for AI is learning "Forward Models" via prediction, learning "folk's physics" so to speak (a.k.a, common sense). He presents the observation that reinforcement learning has a very weak learning signal in the case of sparse rewards, which effectively only allows its use in games, where one can generate all the data needed. He concludes that a direct use of reinforcement in the real world which cannot be run "faster than real time" is not feasible.
- He presents the work done at Facebook on predicting frames of video. He notes that when the predictive system is trained to minimise the mean square error, the resulting predictions get blurry very quickly because of the inherent unpredictability of reality. He therefore proposes adversarial training (yes it is related to adversarial examples), in which the network is not only trained to predict future frames but also to fool another network that tries to guess whether an image is real or generated by the first network. This skews the optimisation criteria (adversarial loss) to generate a more "real looking" images.
Now there are a few important things to be note here:
The mean square optimisation of the prediction collapses into blurry stuff because of the course of dimensionality - mean square error distance in the space of pixels looses any meaning in a very high dimensional space. Doing it all at once is just too much for a mere feedforward mapping.
The approach we took in PVM is qualitatively different - the prediction (not just representation) is, itself, hierarchical. The reason (hypothesised) why this works much better is the following: the world is regular at distinct scales but the interaction between scales is highly irregular. In other words: elementary visual features are relatively predictable. The fate of an object is relatively predictable. But the fate of objects represented as elementary visual features appears highly unpredictable for learning algorithms exactly because they are made to jump between multiple scales and across representations (increasing the dimensionality of representation massively). Consequently the data manifold becomes incomprehensible and the training signal is too weak to uncover its true shape. Further consideration of this multi-scale semi-discintinuity suggests that the data manifold is a fractal with a reasonably regular structure at any individual scale (like e.g. a Mandelbrot Set) but in its full complexity it is obviously horrible to approximate.
LeCun and colleagues, in their paper, use a multi-scale network and reconstruct the image from a laplacian pyramid so in some sense this is multi scale. But in contrast with PVM the higher scales must still operate in the pixel space (only down-sampled). In PVM, prediction is done at many scales but each level predicts the re-encoded representations of the underlying level, not the down-sampled pixels. Consequently each scale has its relevant representation, dynamics and training signal. Each next layer takes the advantage of the already digested prediction from underlying layers and operates on relevant abstractions. For example a higher level feature can represent an object moving in a particular direction without the need to maintain all the details of its illumination and other low level features. Rather this higher level can focus on the stuff relevant to its scale (not just visual scale but more abstract scale). Now obviously in order to recover PVM's prediction at the pixel level we have to use signal from the first layer only, but because that first layer has feedback from many other scales it can reconstruct the "higher level" picture and does a very good job of it (I'll put out some examples soon, just need to regenerate them).
Yann LeCun is very excited about adversarial training, I'm much less so, since I think it is just "patching holes" in an approach that is fundamentally flawed for general purpose vision - effectively tweaking the evaluation metric (loss function) to punish for blurry predictions and force the system into making up signal that is not there to begin with. This is a bit like inserting the prior: "does not have to be physical but has to look feasible".
There are several other important comments worth making on this work.
- Evaluating predictive power is tricky, as a model trained in simple and restricted environment will be good at predicting that environment. Put that same model in new environment and it will fare much worse. A model trained on a very broad and complex environment will suck at predicting both that environment as well as many others. So there is a subtle game to play here to graduate these models into more complex worlds without putting them straight into deep water and expecting miracles - much like you don't ask a child to operate a 747 - evaluation criteria have to progress along with the models and their capacity
- LeCun mentions the use of a forward model to facilitate reinforcement learning, while making a comment that perception part is solved. I dare to disagree with that, perception is far from being solved. In fact having a forward model is crucial for perception, e.g. perception of things which are temporarily hidden, ambiguous (e.g. mirrors) etc. Forward models are important part of motor control in robotics (typically they represent the robot itself, not the entire world). There are far more uses and advantages in having a good forward model than just simulating paths to reward in reinforcement learning.
- Another thing that was not addressed at all in this talk is the issue of scalability of learning. Gradient based methods with a single source of error (end-to-end) training are inherently unscalable. And I'm not talking about scaling them to a huge GPU. GPUs themselves are not massively scalable. I'm talking about scaling the single instance model by a few orders of magnitude.
- I like the answer to the final question about how far are we from a general AI. Yann states clearly that the unsupervised and predictive forward model building is the hill we are facing right now and there may be hills behind it that we cannot yet see. I'd like to add to that, this is the hill we've been always seeing (at least since Moravec stated his paradox and the Nouvelle AI movement). What we've done so far is to merely climb a few boulders, so we could see that hill more clearly.
So in general I think Yann LeCun definitely notices the right problems and I strongly agree with his motivations. At the same time I strongly disagree with his solutions, particularly that adversarial training will get him very far. Just my typical reaction to Yann LeCun's stuff.