Caution: due to a large number of animations this post may take a while to load (depending on your connection speed), please be patient and don't reload unless necessary. The animations will likely load before you read the text.
Scalability in Machine Learning
Scalability is a word with many meanings and can be confusing, particularly when applied to machine learning. For me the meaning of scalability is the answer to this question:
Can an instance of the algorithm be practically scaled to larger/parallel hardware and achieve better results in approximately the same (physical) time?
That is different from the typical understanding of data parallelism, in which case multiple instances of an algorithm are deployed in parallel to process chunks of data simultaneously. An example of scalability of instance (definition above) is for example computational fluid dynamics (CFD). Aside from the need to obtain better initial conditions, one can run the fluid dynamics on a finer grid and achieve better (more accurate) results. Obviously it requires more compute, but generally the increase in complexity can be offset by adding more processors (there are some subtleties related to Amdahl's law and synchronisation). For that reason, most of the world's giant supercomputers are doing fluid dynamics (either simulating nuclear explosions or weather).
There are very few algorithms that have such strong scaling properties. Deep learning in the form of end-to-end perceptron training is certainly not one of those. As a general rule of thumb: the algorithm needs to be divisible into small blocks, each working mainly with local data and communicate very little with others. The algorithm should not rely on any global signal, unless that signal can be delivered with a delay. Otherwise bottlenecks are created in the execution time, which contribute to the sequential part in the Amdahl's law, quickly limiting the number of processors on which the algorithm can be efficiently ran.
Aside from that, the semantics of the algorithm itself may contribute to lack of scalability. Take again multilayer perceptron training for instance: the larger/deeper the network, the weaker are the gradients. Consequently by adding more neurons/layers one extends the time at which the network as a whole will converge to any reasonable solution. In the case of multilayer perceptrons that extension is approximately exponential in the number of layers, since the decay of the gradient is itself exponential. As a result, even with the powerful GPU's, the deepest, practically trainable networks are at the order of 20 layers deep.
I should make a note here on the RESNET architecture (residual networks), where instances are claimed to have 150 layers or so. These networks are actually not that deep, since these additional layers are added as bypass connections. Ultimately, great majority of the gradient in those networks originates from the short connections. See this paper for interesting insights of the contribution of gradients in RESNETs. In summary the following is a short piece from the conclusions:
(...) Finally, we show that the paths through the network that contribute gradient during training are shorter than expected. In fact, deep paths are not required during training as they do not contribute any gradient. Thus, residual networks do not resolve the vanishing gradient problem by preserving gradient flow throughout the entire depth of the network. This insight reveals that depth is still an open research question. (...)
So deep learning in not scalable in the sense of my strong definition above, even with the RESNET trick. There remains obviously data parallelism, in which case one can actually process separate batches of data on instances running on different GPU's and speed up training, but it is by no means possible to take say VGG16 network, multiply the number of layers by 10x, multiply the number of feature maps by 10x, multiply the size of each layer by 10x, throw it at 1000 GPUs and expect the thing to train successfully. Aside from the problems of efficiently implementing the parallel execution, the thing would need ~exp(1000) more data and training iterations which is not possible.
This is part of the reason why AlexNet or VGG 16 architecture, which are now is a few years old, are still the base architectures for many applications, while the biggest deep learning instances trained today are at most one order of magnitude larger.
Cortex is different
Whatever the brain is doing appears to be different from existing machine learning. Whether we assume that the basic unit of computation is a neuron or not (I'm frankly not convinced of that), the solution that the biology had found is certainly scalable. The volume of the mouse brain is roughly 500 cubic millimetres. The volume of the human brain is roughly 1500 000 cubic millimetres. Volume wise 3000x larger! If we believe it is the cortical area that matters (since white matter are mostly fibres), then the rat's brain is ~6cm2, while the human brain is 2500cm2, so more than 400x factor. By a simple extrapolation, if we assume that VGG16 with its 138M parameters is approximately a mouse brain (which it, functionally speaking, isn't; a mouse is a lot more capable), then to obtain a human like size we would need to train between 55-400 billion parameters. A rat lives for a few years and learns majority of what it needs in say its first 3 months, for a human let's say it is the first 20 years, so we have roughly 80x the training time. This is linear in the number of parameters! We are doing something wrong with our end-to-end training!
PVM is also different
Similarly to the cortex, Predictive Vision Model (PVM) is also scalable in the strong sense. One can vary the size of the instance, number of layers etc. and aside for the need for more compute resources, the convergence time in the number of training steps remains in the same range. The scaling seems roughly linear in the number of layers (but that is just from my observation and needs to be more thoroughly validated), and it is certainly below exponential. Importantly, aside from tweaking the scale parameters, typically there is no need to modify any other meta-parameters of the model, unlike with deep nets which require massive amount of tuning. To substantiate my claims, take a look at the examples below:
In the previous post I've shown a small instance of PVM doing a sequence recall (dreaming) and filling in of an occluder. Below is a video of that small instance running at ~2mln steps into training.
Video 1. Small PVM model at ~2mln steps. The frames in the top row are from left: input signal (visual), internal compressed activations of the subsequent layers. Second row: consecutive predictions, first layer predicts the visual input, second layers predicts the first layer activations and so on. Third row is the error (difference between the signal and prediction). Fourth row is the supervised object heatmap (here unused). Fifth row: prediction of one frame ahead in the future. Rightmost column: various tracking visualisations (here unused).
The PVM model used in the above video has the following structure (json file with description available here):
- Input layer 12x12 PVM units, each processing 5x5 RGB patch, consequently input image is 60x60 pixels RGB
- Second layer 8x8 PVM units, each processing 14x14 patch from the previous layer (2x2 units each with 7x7 internal representation).
- Third layer 6x6 PVM units, each processing 14x14 patch from the previous layer
- Fourth layer 4x4 PVM units
- Fifth layer 2x2 PVM units
- Sixth layer 1 PVM unit
All together 6 layers with 265 PVM units (each is essentially a perceptron shaped in an auto-encoder like hourglass, as described here)
Let us now try to scale this up. I'll now make a model with the following parameters (json file with description available here):
- Input layer of 20x20 PVM units
- Second layer 16x16 PVM units
- Third layer 12x12 PVM units
- Fourth layer 8x8 PVM units
- Fifth layer 6x6 PVM units
- Sixth layer 4x4 PVM units
- Seventh layer 2x2 PVM units
- Eighth layer 1 PVM unit
As mentioned above, I only scaled the model (added layers). Everything else is unchanged, all the PVM units execute the same code, uniformly over the entire model structure.
Note that with the json files I provide you can run those simulations yourself by installing PVM code from the github repo as described in the readme and then running e.g.:
python PVM_run.py -S small.json -D
It may take a few days to reach several millions steps. For me it took ~4 days to reach 7mln for the small model and ~10 days for the large model on a ten core intel i7- 6950X, (scales nearly linearly with the #cores * # clock frequency, so if you happen to have a 50core Xeon machine then it will be ~4x faster). Also run without -D option (no display for faster execution).
So back to our larger model, all together 921 PVM units (almost 4x the number of parameters of the small model) and two layers deeper. I trained it on the same sequence, but now the input image is 100x100 so nearly twice the resolution. All in all, although it is not exactly correct to compare the values of MSE (mean squared error), since the larger model actually operates at a higher resolution image, it appears that both models are roughly at the same stage of convergence (small model approx 0.0022 globally averaged MSE, while the larger one 0.0024).
And below the actual view of the model running at 2mln steps to compare directly with the small model above. The video is compressed, but one can get the idea that both models are already reasonably trained. Note, 2mln steps is absolutely tiny amount of training for a model that is 8 layers deep. Any conventional end-to-end deep net would not even begin to converge.
Video 2. Large PVM model at ~2mln steps. The frames in the top row are from left: input signal (visual), internal compressed activations of the subsequent layers. Second row: consecutive predictions, first layer predicts the visual input, second layers predicts the first layer activations and so on. Third row is the error (difference between the signal and prediction). Fourth row is the supervised object heatmap (here unused). Fifth row: prediction of one frame ahead in the future. Rightmost column: various tracking visualisations (here unused).
We can also compare side by side the dream sequences that these models produce (see the description of the procedure here). The recalls made by the larger model are more detailed, but relatively quickly end up in a cycle attractor. Notice however how in one of the sequences (third from the top on the right) the model is able to correctly recall a car driving away while the camera was moving. All sequences below generated after approximately 7mln steps of training.
|Dreams of the small model||Dreams of the large model|
Similarly we can compare the ability to fill in behind an occluder as shown below:
Blindspot filling in of the large model (below):
Blindspot filling in of the small model (below for comparison):
Again as expected the larger model is able to capture a lot more detail.
If you want to try this yourself, while running a PVM simulation on your computer (as described above), do the following:
nc localhost 9000
The debug shell has a few more useful commands, just type help to get them or refer to the documents on github.
It turns out that following the predictive paradigm and constructing a system from grounds up with that paradigm in mind results in an architecture that is vastly more scalable than the end-to-end trained models - hard to believe it is just a coincidence. In fact, with the online operation and learning, PVM could likely be implemented on a loosely synchronised system (without a global clock), which would make it infinitely scalable (nonexistent sequential part in the Amdahl's law).There are many things that remain unresolved with PVM like models, namely how to best use the forward models created in their training and even before that, how to evaluate the quality of these models. What in my honest opinion remains clear (and if you think I'm insane then beware that Yann LeCun claims the same thing), is that building such systems is a necessary step towards real, general intelligence with the ability to correctly generalise in real world situations.