Many people these days are fascinated by deep learning, as it enabled new capabilities in many areas, particularly in computer vision. Deep nets are however black boxes and most people have no idea how they work (and frankly most of us, scientists trained in the field can't tell exactly how they work either). But the success of deep learning and a set of its surprising failure modes teach us a valuable lesson about the data we process.
In this post I will present a perspective of what deep learning actually enables, how it relates to classical computer vision (which is far from being dead) and what are the potential dangers of relying on DL for critical applications.
The vision problem
First of all, some things need to be said about the problem of vision/computer vision. In principle it could be formulated as follows: given an image from a camera allow the computer to answer questions about the contents of that image. Such questions can range from "is there a triangle in the image", "is there a human face in the image" to more complex instances such as "is there a dog chasing a cat in the image". Although many of these questions may seem alike and pretty much all of them are trivial for a human to answer, it turns out there is a huge variability in complexity of problems stated by such questions. While it is relatively easy to answer questions such as "is there a red circle in the image" or "how many bright spots are in the image", other seemingly simple questions such as "is there a cat in the image" can be way more complex to solve. And by way more I seriously mean way more. The separation between "simple" vision problems and "complex" vision problems seems almost endless.
This is important to note, since for human beings - highly visual animals - pretty much all questions I stated so far are trivial. And it is not difficult at all to state visual questions simple even to children, that currently have no solution at all, even with the deep learning revolution.
Traditional Computer Vision
Traditional computer vision is a broad collection of algorithms that allow to extract information from images (typically represented as arrays of pixel values). There is a wide range of methods used for various applications, such as denoising, enhancement and detection of various objects. Some methods are aiming at finding simple geometric primitives, e.g. Edge detection, followed by morphological analysis, Hough Transform, Blob detection, Corner detection , various techniques of image thresholding etc. There are also feature representations techniques and transformation such as histograms of oriented gradients, Haar cascades etc. which can be used as a front end to a machine learning classifier to build a more complex detector.
Contrary to popular belief, the tools discussed above combined together can form very powerful and efficient detectors for particular objects. One can construct a face detector, a car detector, street sign detector and they are very likely to outperform deep learning solutions for these particular objects both in accuracy as well as computational complexity. But the problem is, each detector needs to be constructed from scratch by competent people, which is inefficient, expensive and not scalable.
Hence great detectors were historically only available for stuff that had to be detected frequently and could justify the upfront investment. Many of these detectors are proprietary and not available to the general public. There are great face detectors, license plate readers etc. But no one in their right mind would invest money into writing a dog detector or a classifier that could classify a breed of dog from an image. It would simply be expensive and impractical. And this is where deep learning comes in.
The story of a savant student
Imagine you are running a class in computer vision. In the first few lectures you go over the abundance of techniques (such as the ones discussed above) and then the remaining hours are left for student projects. The projects are expressed as datasets of images and a tasks (formulated as questions like the ones above). The tasks begin simple e.g. by asking is there a circle or a square in the image, all the way to more complex tasks such as distinguishing a cat from a dog. The students write computer program every week to solve the next task. You review the code and run it on some samples to see how well did they do.
A new student joined your class this semester. He does not talk much or socialize. He never asks questions. When he submits his first solution you are a bit baffled. It is a bunch of really unreadable code, not like anything you have ever seen. It seems he is convolving each image with some randomly looking filters and then some very strange logic of many, many "if then else" statements with a huge number of magic parameters is used to obtain the final answer. You run this code on samples and it works perfectly fine. You think to yourself: well that is certainly an unusual way of solving this problem but as long as it works it is fine.
Weeks pass, more complex tasks are given out and you get progressively more complex code from this guy. You can't really understand anything from it, but it works surprisingly well even for some sophisticated tasks.
Near the end of the semester a final task is given - in a set of real images distinguish cats from dogs. No student ever could accomplish more than 65% accuracy on this tasks, but the savant student gives you the code and it rips 95% performance. You are stunned.
At this point you spend the next few days analyzing deeply the unreadable code. You give it new samples. You modify the samples, try to find out what affects the decision of the program, reverse engineer it. Eventually you reach a very surprising conclusion: the code detects dog tags. If it is able to detect a tag (which is a simple silver ellipsoid), then is figures out if the lower part of the image is brown. If it is, then it returns "cat" otherwise "dog". If it cannot detect a tag, then it checks if the left part of the image is more yellow than the right part. If so, it returns "dog", otherwise it returns "cat". You invite the student to your office and present him with your findings. You ask if he thinks he really solved the problem? After long silence he eventually mumbles that he solved the task as presented by the dataset, and he does not really know how a dog looks like and how it is different from a cat...
Did he cheat? Yes, clearly the task he is solving has nothing to do with the task you had in mind. And no, because as you stated the task by the dataset, the solution is valid. None of the other students came even close in terms of raw performance. However they were trying to solve the task as stated by the question, not by the raw dataset. Their programs don't work as well, but also didn't make bizarre mistakes. E.g. when you paste an image of a tag into a photo of a cat, you can easily make it detect a dog. Programs of others students cannot be fooled by this, but also do not work nearly as well on the data given.
The blessing and the curse of deep learning
Deep learning is a technique that uses a certain optimization technique called gradient back-propagation to generate "programs" (also known as "neural networks") such as those written by the savant student in the tale above. These "programs" and the optimization technique knows nothing about the world, all it cares about is to build a set of transformations and conditions that will assign the correct label to the correct image in a dataset. If there is some cheap trick available (such as all pictures of dogs in the dataset have blue hue in the left top corner) it will use it without a blink [by design].
These spurious biases can be removed by adding more data to the training set, but the "programs" generated by back-propagation can be very large and very complex with millions of parameters and thousands of condition checks, hence they can latch on to combinations of combinations of much finer biases. Anything that allows to assign a correct label to statistically optimize the objective function will do, whether has anything to do with the "semantic spirit" of the task or not.
Can these networks eventually latch onto things that are "semantically correct" priors? Surely yes. But there is now a mountain of evidence that this is not what they actually do. Adversarial examples show how very tiny, imperceivable modifications to the image can completely throw off the results. Studies of new samples resembling previously trained datasets show that generalization beyond the original dataset is much weaker than generalization within dataset , hence indicating a particular low level feature of the given dataset that the network latched onto. In certain cases it is enough to modify a single pixel to throw off a deep network classifier.
In some ways, the biggest strength of deep learning - ability to automatically create features that no human would have thought of - is simultaneously its biggest weakness, since most of these features appear semantically "questionable" to say the least.
When it makes sense and when it does not?
Deep learning is certainly an interesting addition to the computer vision toolbox. We can now relatively easily "train" detectors for objects that otherwise would be too expensive and impractical to implement. We can also, to a certain degree, scale these detectors to use more compute power. But the price we pay for that luxury is high: we don't know how they make their decisions, and we do know that most likely the basis on which classification is made has nothing to do with the "semantic spirit" of the task. Hence the detector will unexpectedly fail any time the low level biases present in the training set are violated in the input data. These failure conditions are practically impossible to characterize.
So in practice deep learning is great for applications in which errors are not critical and where inputs are guaranteed not to vary a lot from the training dataset. Problems where 95% performance is enough. That includes e.g. image search, maybe surveillance, automated retail (something I've been working on for the past few months) and pretty much everything that is not "mission critical" and can be reviewed and corrected later. Ironically most people view deep learning as a revolution in application space in which decisions are made real time, errors are critical and lead to fatal outcomes, such as self driving cars, autonomous robotics [ e.g. recent study shows that autonomous driving solutions based on Deep Neural nets are indeed susceptible to adversarial attack in real life ]. I can only describe such belief as an "unfortunate" misconception.
Some have high hopes for application of deep learning in medicine and diagnosis. However in this space there are also some concerning findings e.g. failure of models trained on one institutions data, to validate well on another institutions data. This again is consistent with the view, that these models take a much shallower scoop of the data than what many researchers hoped for.
Data is shallower than we thought
Deep learning surprisingly taught us something very interesting about visual data (high dimensional data in general): in ways it is much "shallower" than we believed in the past. There seems to be many more ways to statistically separate a visual dataset labeled with high level human categories, than there are ways to separate such dataset that are "semantically correct". In other words, the set of low level image features is a lot more "statistically" potent than we imagined. This is the great discovery of deep learning. The question of how to generate models that would find "semantically sound" ways of separating visual datasets remains open and in fact now seems even harder to answer than before.
Deep learning is here to stay and is now a crucial part of computer vision toolbox. But traditional computer vision is not going anywhere and could still be used to build very powerful detectors. These hand crafted detectors may not achieve as high performance on some particular dataset metric, but can be guaranteed to rely on "semantically relevant" set of features of the input. Therefore their failure modes can be much better characterized and anticipated. Deep learning is providing statistically powerful detectors without the expense of feature engineering, though one still has to have a lot of labeled data, lot of GPU's and a deep learning expert onsite. However these powerful detectors will fail unexpectedly and their range of applicability cannot be easily characterized (or to put it more strongly - cannot be characterized at all, period). In applications where rare but catastrophic failure is acceptable, deep learning will work fine. For applications in which guaranteed performance within a given set of conditions plus computational complexity are more important, classical machine vision pipelines will be in use for many years to come.
PS: A word about AI
Note, none of the discussion above has anything to do with AI as in "artificial intelligence". My view on AI is that a it will for sure not be solved by a "programmed features" or "logical reasoning", but rather by some form of connectionist structure interacting with physical environment and "learning" from it as in the Predictive Vision Model I'm proposing. I don't think deep learning as practiced right now has anything to do with solving AI. I do think though, that a combination of deep learning, feature engineering and logical reasoning can accomplish very interesting and useful technical capabilities in the broad space of automation.
If you found an error, highlight it and press Shift + Enter or click here to inform us.