In this post I will explore the capabilities of contemporary deep learning models on the vitally important task of detecting a cat. Not an ordinary cat though, but a sketch of an abstract cat. This task matters because success tells us something about whether a visual system has learned generalization and abstraction -- at least on par with a 2-year old. This post is inspired by my ex co-worker Peter O'Connor who tried similar experiments on LeNet several years ago. In addition, this post is a continuation of this blog's highly popular "Just how close are we to solving vision?" which to-date has amassed nearly 15,000 hits. Let's begin by introducing my menagerie:
Figure 1. The cat menagerie. From left to right (top to bottom): "abstract cat", rough sketch of a real cat, less rough sketch of a real cat, the "best cat" I could draw, "best cat" inverted.
I made these sketches myself, based on a photo of a cat. NOTE: Whenever you test a deep net (or any other machine learning model), always use new data. Anything you find on the Internet is either already in the training set or soon will be.
Caution: due to a large number of animations this post may take a while to load (depending on your connection speed), please be patient and don't reload unless necessary. The animations will likely load before you read the text.
Scalability in Machine Learning
Scalability is a word with many meanings and can be confusing, particularly when applied to machine learning. For me the meaning of scalability is the answer to this question:
Can an instance of the algorithm be practically scaled to larger/parallel hardware and achieve better results in approximately the same (physical) time?
That is different from the typical understanding of data parallelism, in which case multiple instances of an algorithm are deployed in parallel to process chunks of data simultaneously. An example of scalability of instance (definition above) is for example computational fluid dynamics (CFD). Aside from the need to obtain better initial conditions, one can run the fluid dynamics on a finer grid and achieve better (more accurate) results. Obviously it requires more compute, but generally the increase in complexity can be offset by adding more processors (there are some subtleties related to Amdahl's law and synchronisation). For that reason, most of the world's giant supercomputers are … Read more...
Caution: due to a large number of animations, fair amount of traffic and the tiny size of my web hosting machine, this post may take a while to load, please be patient and don't reload unless necessary.
There has recently been a fair amount of deep learning work on video prediction and generative models that focuses on infusing motion into static pictures. One such paper is e.g. available here:
The approach taken in that paper was to train a model on a huge amount of data and explicitly separate the task of prediction into (1) the generation of static background and (2) a moving object. As much as this work is impressive, the separation into background and foreground prediction seems a bit unnatural. Given however the nice mesmerising quality of video (and the importance of prediction) I decided to play a little bit with our Predictive Vision Model (PVM) which is also capable of generating such "dreams". For the sake of this post I only trained a very small instance of PVM on a single relatively short video, so the results shown here are mainly illustrative and this is by no means a full blown scientific study.… Read more...
In many of my posts I'm directly or indirectly postulating learning of the physics as a way to create a "real AI". The point I'm trying to makes is so obvious, that it actually is not obvious at all and it took me some time to realise it. As with many such obvious/non-obvious things it takes multiple angles before the essence could be captured, hence why I write this blog. I'm trying to express myself in many ways until I hit the explanation that everyone just simply gets. So let me try again in this post:
The world around us is complex. Everything to some degree interacts with everything else, there are lots of regularities but there is also a fair amount of chaos. No two trees look identical, yet we manage to categorise them. In terms of physical language it appears that a good chunk of our reality is a "mixing system" at the "edge of chaos" (or otherwise critical). We therefore cannot very well predict what will happen. Yet I'm postulating prediction as a training paradigm, does this make any sense?
It does and here is why: even with the chaotic world, there are numerous aspects of … Read more...
Recently Tesla had shown a teaser video of their "self driving car" project which immediately drew media attention and swarms of self driving "enthusiasts" to again announce that this is a done deal already (which it is not). Here is the video in question:
Note: above video has been subsequently taken down, I'm now linking to a mirror.
Now this looks very impressive as a demo but there are a few details I'd like to point out before we start saying again that the self driving car is a done deal from technological point of view. Disclaimer: I do like Tesla and I think some of their ideas are great, but their self driving seems a bit premature, somewhat over promised and over hyped.
The lighting conditions in a video are perfect from computer vision point of view. Although it is a bit foggy, the illumination is uniform and diffused. There are no hard shadows, flares or ghosts.
The lane markings are all clearly painted and visible everywhere.
There are no "unusual situations" (see below what I mean by that).
Just a reminder that a self driving car was demoed as a research project in mid 80's at CMU … Read more...
Although the talk is over 1h long, it is certainly worth watching and I strongly recommend doing that before you read any of the following text.
After the lecture
Yann LeCun is a rather colourful character and certainly has strong opinions on many subjects. I find myself at any given time either strongly agreeing or strongly disagreeing with him and it's no surprise it is the same this time around. Anyway, he makes several points in his talk which I think are relevant to our published work on PVM (PVM paper for details) and worth more detailed comment.
AI must be close to being solved since recent progress shows that technological singularity is inevitable and close?
Singularity may or may not happen. As with any reasoning that extrapolates certain trends, there could be barriers that prevent these prophecies from ever materialising. If we were to extrapolate the distance travelled by humans in space between late 1940's to early 1970' and fit it with an exponential curve, we would have had to have sent astronauts to Jupiter by now. Clearly did not happen. Same with Moore's law and progress in computing. Although there was a period when computing power would double every 20 months or so, it is not clear if this still applies (comparing contemporary computers with those from say 10 years … Read more...
I enjoyed Nassim Nicolas Taleb books and like his style of calling out some of the - let's put it mildly - misconceptions in theoretical approach to economy. One of his key ideas is that of Ludic Fallacy, that is the use (abuse) of game analogies to real world situations. This fallacy stems from the fact, that since the reality is incomprehensibly complex we typically restrict the scope of research (or any other mental activity) to some model world - game - where the rules are all known (assumed). We then derive conclusions about some aspect of reality, forgetting that the conclusions were derived in the model world and the uncertainties as to whether that model world was accurate are inherited by those conclusions. For example: if I assume, based on previous cases, that given pool results indicate a particular candidate will win the election, I silently assume that nothing else fundamental has changed since the "previous cases" and the analogy can be drawn. But if something has changed outside of the model, then my prediction just as well can be completely useless (even if it has nice "confidence level" derived within the model). Recent US elections … Read more...
I've elaborated in my previous post on why I think predictive capability is crucial for an intelligent agent and how we get fooled by getting 90% of motor commands right from a purely reactive system. This also relates to a way of thinking of the problem in terms of either statistics or dynamics. The current mainstream (statistical majority) is focused on statistics and that statistically works. However much like with guiding behavior, statistical majority may omit important outliers - important information is often hidden in the tail of the distribution.
I've mentioned the Predictive Vision Model which is our (me and a few colleagues that think alike) way to introduce predictive paradigm into machine learning. It is described in a lengthy paper, but not everyone has the time to go through it, so I will briefly describe the principles here:
The idea is to create a predictive model of the sensory input (in this case visual). Since we don't know the equations of motion of the sensory values, the way to do it is via machine learning - simply associate values of inputs now with those same values in the future (think of something like an autoencoder … Read more...
This post is an extension of my previous post on Statistics vs Dynamics in machine learning. I'll try to expand here on what I think is the key missing ingredient (possibly not the only one) for efforts such as a self driving car or other robotic projects that are aimed at unrestricted environments.
The way the problem of control in machine learning is approached today is by end-to-end training of motor command based on sensory input (such as e.g. here). The authors argue that the optimisation algorithm will do a better job than explicitly breaking down the task into perceptual/planning submodules because it can do everything at once. This logic is influenced by behaviourism and the observation that humans essentially appear to do the same thing - map sensory input onto motor command.
This approach is flawed, as I will try to explain in paragraphs to follow.
What looks like direct sensory to motor mapping maybe be a lot more complex
Looking naively at a human performing a task of let's say driving a car, one may think that the human performs the task of matching what he currently sees onto a motor command. This certainly looks like that, … Read more...
There are two very important branches of mathematics relevant for building intelligent systems: statistics and dynamics. The rationale is the following:
data has regularities and patterns that repeat, therefore an intelligent system should analyse them statistically
things in the world are in motion and that motion has regularities, therefore the intelligent system should build models of that dynamics
Although seemingly these approaches are very compatible, it is important to understand the different modes of thinking: statistics tries to find a pattern given we know nothing else about the system (often making assumptions that things come from a known distribution) based on many samples. Dynamics tries to write down the equations of motion of the system given very few samples. Statistics wants to estimate the expected value and variance of things. Dynamics wants to predict exact value of something with strict error estimate.
Current machine learning is heavily biased towards statistics. Although some priors are inserted into the models, the general approach is to throw more data and compute power at a system and expect miracles, rather than building a system that could intelligently infer based on the dynamics (see e.g. the ImageNet and similar purely statistical approaches to understanding images … Read more...
Everybody is trying to build a self driving car today. Google has been testing their solution for the past ten years or so, Tesla just announced they'd be putting the "self driving hardware" onto their newly manufactured cars, Uber has a big effort with Volvo in Pittsburgh, comma.ai is trying to ship a box for outfitting certain cars with a self driving mode etc. Obviously the car manufacturers are following with Ford making announcements recently, BMW working silently and so on and so on. Some of these efforts are explicitly cautious on what they promise (driver assist technology rather than full autonomy such as e.g. Toyota), but many voices, particularly the VC's from the Bay area are hyperactive announcing how the life will be great and how the self driving car (in the sense of full autonomy) is a done deal.
Well I would not be a sceptic if I did not put all those hyper-optimistic statements to doubt. Let me go through a few claims about self driving cars one by one and put my sceptical comment next to each statement. To be frank: I'm not against the technology, I'm against the hype.
There is an ancient argument in the field of AI called the Chinese room experiment. The thought experiment proposed by John Searle in the early eighties goes as follows:
You put somebody who does not know Chinese in a room
You give them a lengthy instruction (a program) on how to respond to given Chinese symbols
Finally you run the experiment by feeding in Chinese sentences in the input and getting sentences at the output. The Chinese fellows are convinced they are running a conversation with a sentient being but the poor guy inside just shuffles symbols and has no idea what is he conversing about
The conclusion is that even though the external observers assume (by Turing test) that they are observing intelligence, the guy inside is clearly unaware of what is going on, and therefore the intelligence is somehow unreal.
Personally I have several issues with that experiment. First of all it is a thought experiment and it assumes we can have externally recognised intelligence implemented by a guy with a book of symbol transformations. Although a computational in/out relation like that should be implementable by a "computer", the size of the necessary derivations could be enormous. In … Read more...
In the previous posts I've been investigating the current state of the art deep nets for casual vision application - telling what is in the image taken in an average office and average boring street. I've also played a bit with adversarial examples to show how the deep nets can be fooled. These failure modes tell us something important about the level of perception we are dealing with - very basic level. In this post I will discuss why I think perception is such an elusive problem. Let's begin with vision.
Each of us is born with a blindspot in their visual field - the place where nerve fibres from the retina exit the eyeball. However, unless somebody tells us how to discover it, we are completely ignorant of its existence. In some sense it could be qualified as an example of anosognosia - a condition in which humans are not aware of a defect in their perception. A more extreme case of this is known as Anton-Babinski syndrome, typically occurring after a brain damage in which the patient claims to see even though he is technically blind! As much as this seems unbelievable, patients will confabulate … Read more...
In the previous post I applied an off the shelf deep net to get an idea how it performs on average street/office video. The purpose of this exercise was to critically examine and reveal what these award winning models are actually like. The results were a mixed bag. The network was able to capture the gist of the scene, but made serious mistakes every once in a while. Granted the model I used for that experiment was trained on ImageNet which has a few biases and is probably not the best set to test "visual capabilities in the real world". In the the current post I will discuss another problem which is plaguing deep learning models - adversarial stimuli.
Deep nets can be made to fail on purpose. It's been first shown in  and there have been quite a few papers since then with different methods to construct stimuli that fool deep models. In the simplest case one can directly derive these stimuli from the network itself. Since ConvNets are purely feedforward systems (most of them at least), we can trace back the gradients. Typically gradients are used to modify the weights such that they better fit the given … Read more...