Caution: due to a large number of animations, fair amount of traffic and the tiny size of my web hosting machine, this post may take a while to load, please be patient and don't reload unless necessary.
There has recently been a fair amount of deep learning work on video prediction and generative models that focuses on infusing motion into static pictures. One such paper is e.g. available here:
The approach taken in that paper was to train a model on a huge amount of data and explicitly separate the task of prediction into (1) the generation of static background and (2) a moving object. As much as this work is impressive, the separation into background and foreground prediction seems a bit unnatural. Given however the nice mesmerising quality of video (and the importance of prediction) I decided to play a little bit with our Predictive Vision Model (PVM) which is also capable of generating such "dreams". For the sake of this post I only trained a very small instance of PVM on a single relatively short video, so the results shown here are mainly illustrative and this is by no means a full blown scientific study.
You can experiment with PVM yourself, check out the source code available on github.
Feedback and lateral connectivity matter in PVM
First of all let's take a look at this trained model. I trained it for ~7mln steps, this is a tiny fraction of the amount of training described in the MIT paper. Below we can see the model running at ~2mln in training:
Video 1. The frames in the top row are from left: input signal (visual), internal compressed activations of the subsequent layers. Second row: consecutive predictions, first layer predicts the visual input, second layers predicts the first layer activations and so on. Third row is the error (difference between the signal and prediction). Fourth row is the supervised object heatmap (here unused). Fifth row: prediction of one frame ahead in the future. Rightmost column: various tracking visualisations (here unused).
The videos included in this post are highly compressed, but one can see that the system works quite well at 2mln steps (and it works even better at 7mln steps where I generated some of the samples below). Before we go any further we can perform a simple experiment: we can study the importance of feedback connectivity in the system through a "lesion study". We freeze learning and disable feedback/lateral connectivity. What we get is this:
Video 2. The system with disabled feedback and lateral connectivity. Legend as above.
The prediction gets much much worse (note the pattern of activity in the second row, from the top, leftmost square). This indicates that the feedback and lateral connections were indeed important and severing them does substantial damage to the systems dynamics and predictions. The system could be retrained to not rely on feedback so much and the predictions would improve somewhat, but the point is clear: when the feedback and lateral connectivity is available it gets integrated into the recurrent model of the input.
Now let's have some fun. We can make PVM generate a sequence by enabling the "dream" mode. Here is how that works: the prediction made by the PVM at the input level gets copied over as an input frame for the next step. So we allow the system to follow its own predicted trajectory. The system keeps generating new predictions until it hits a fixed point. A few examples are shown below:
As you can see, the sequences above appear to converge to a picture of the red curb, which seems to be an attracting point of the predictive dynamics. I'll elaborate later on my little working theory on how this may relate to sleep. Nevertheless some of these dream sequences can be pretty lengthy, take a look below:
Blindspot and contextual fill in
Now we are ready for another experiment, namely to see if the network can guess what is behind an occluder. For this we present the network with part of the visual field covered with a grey spot. Since the network was never exposed to such input in its training it will use the surround to make some sensible scene and in doing so will try to guess what is in place of the occluder.
As we might expect, these fill ins are mediated entirely by lateral and feedback connections. If we disable those, the entire prediction collapses but also the occluder becomes completely static (a bit hard to see):
We can also make the system reconstruct the surround based on the centre:
Dreams from static frames
The work mentioned above (http://web.mit.edu/vondrick/tinyvideo/) is attempting to create motion sequences from single static frames, much like a human, who typically sees a story/dynamics in a static photo. In previous experiments the dream mode gets enabled as the model goes through the video, consequently some aspects of motion are remembered in state of the recurrent connectivity. In the experiment below we will remove all that motion memory from the system and allow it to start dreaming from a static frame. The frame is presented for a long time (~100 steps) to erase any trace of recent motion and then the dream mode can proceed. If the system will generate any motion, that motion is only conditioned on that single frame and not the past sequence. Let's see what that looks like (conditioning frame on the left, sequence on the right)
As expected the system is capable of generating sequences from static frames as well. Granted these are only toy examples from the actual training set, but the system will generate similarly looking sequence from arbitrary initial frame.
Theory on why animals need to sleep
Now as I noted above, the dreams above usually end up in some attractor state, which for this particular model and data seems to be the picture of the red curb. My hypothesis, is that generally a complete well functioning system should avoid having such fixed points, as once the trajectory goes into one of them, the subsequent prediction is useless. Having the ability to imagine as far into the future as possible is likely advantageous. Therefore the brain should have some mechanism to eliminate such attractors. In that formulation "sleep" would not be so much about consolidation of memories but about removal of fixed points of predictive dynamics.
How could we remove them in PVM? Well simply run the "dream" mode and then when the system hits an attractor, turn on the learning but with a negative sign. In this way the attractor should become unstable and the dynamics can get out of it. I'm currently reviewing neuroscience literature for any indication that some phases of sleep would trigger plasticity reversal in the cortex. If such evidence is found, that would be another indication that the PVM is on the right track for general cortical computation.