In this post I will explore the capabilities of contemporary deep learning models on the vitally important task of detecting a cat. Not an ordinary cat though, but a sketch of an abstract cat. This task matters because success tells us something about whether a visual system has learned generalization and abstraction -- at least on par with a 2-year old. This post is inspired by my ex co-worker Peter O'Connor who tried similar experiments on LeNet several years ago. In addition, this post is a continuation of this blog's highly popular "Just how close are we to solving vision?" which to-date has amassed nearly 15,000 hits. Let's begin by introducing my menagerie:
|Figure 1. The cat menagerie. From left to right (top to bottom): "abstract cat", rough sketch of a real cat, less rough sketch of a real cat, the "best cat" I could draw, "best cat" inverted.|
I made these sketches myself, based on a photo of a cat. NOTE: Whenever you test a deep net (or any other machine learning model), always use new data. Anything you find on the Internet is either already in the training set or soon will be.
Now it's time to have some fun. First I took VGG16 trained on ImageNet. ImageNet does not have an explicit category for a general cat, but there are several kinds of cats (category number in parenthesis): tabby (281), Tiger cat (282), Persian Cat (283), Siamese Cat (284) and Egyptian Cat (285). I'm not picky: I will consider a success if any of my cats get classified as any of these classes (I'm not even sure what kind of cat I have drawn here, probably a tabby). Again I used Keras and the models pre-trained, downloaded from the web. Let's see the results:
|Figure 2. Cat results from VGG16. Labels and confidence in top left corner of each image.|
OK, my first sketch, the "abstract cat," is not recognised. (If anything it's considered a "fire screen"). This outcome is entirely expected (or unexpected, depending on your knowledge of how deep nets works). It seems that only humans (age of 2 and above) have the ability to connect rough abstract sketches with a particular animal (my 2 year old daughter had no doubts).
The subsequent sketches aren't classified all that well either. Ultimately, my "best cat" sketch is classified as "Egyptian cat" but with extremely low confidence (0.05) and the model is equally confident that it sees a zebra or triceratops. Oh well, I would not call this "superhuman vision" capability. Let's move on.
VGG16 used above is an old generation model, and there's been some improvement since, namely with the use of residual networks. In this case I will use RESNET-50 again pre-trained on ImageNet and downloaded using Keras. Same procedure as above:
|Figure 3. Results with RESNET-50. Labels and confidence in top left corner of each image.|
Alright, again the "abstract cat" is completely miss-classified. Same with the more detailed sketches. The "best cat" is quite confidently classified as a dalmatian (0.74) and only with 0.04 confidence as the "Egyptian cat". Interestingly the inverted "best cat" is recognised with medium-low confidence (0.28) as a jaguar.
One might argue that I'm doing something wrong, my networks are somehow messed up, or I have a bug and my code transposes the picture somewhere or what not. So for a sanity check I tried the same experiment with clarifai.com (through their demo site) which is IMHO the best deep-net based image recognition engine out there. Results are below:
|Figure 4. Results from ClarifAI (direct print screen from their website).|
Much like VGG-16 and RESNET50, ClarifAI is unable to correctly classify the abstract cat, nor is it able to classify the rough sketches. It does classify my best cat as a cat (success!), it is however only the seventh ranked keyword returned. And all it takes is to invert the picture (the white cat drawing on black background) for ClarifAI engine to get confused again. BTW: since I uploaded these cat images to ClarifAI, they may become part of their training set, consequently future results from them might be different than those presented here.
So overall, three different deep nets and none are able to really confidently deal with any of these sketches. Is this surprising?
Why is it so?
These results are actually not very surprising. Deep nets do not see the world the same way humans do (as I argue in many posts in this blog). To a deep net, a canonical (Egyptian) cat is this, as can be determined by probing the network:
Figure 5. An adversarial example constructed from the category "Egyptian cat" classified with 98% confidence.
Human vision generalises in completely different directions than deep learning models. We are able to recognise objects in rough sketches even without shading - somehow it is often sufficient for us to see a contour. When shading is available, then we see things very confidently.
In contrast, deep nets need textures and frankly this is the only thing they care about. See the result in the following figure:
|Figure 6. A cat picture from the internet (left), and the same picture chopped into pieces and displaced. Both get correctly and confidently classified as a tabby (RESNET-50).|
Disjoint collection of textures will be more convincing to a deep net than even the best contour (as shown in Figure 6 above). A cat chopped into pieces and displaced is still a 91% cat to a deep net, while even the best sketch never exceeded 5% confidence. And this is, in my opinion, a substantial warning sign that our contemporary approach to vision (even though celebrating many successes, particularly compared to earlier solutions) is likely fundamentally flawed. My proposed direction for vision is the predictive vision model. Although I cannot yet show the sort of generalisation required for the above experiment to succeed with PVM, I'm certain that PVM generalises in different directions that the feedforward Deep Nets.
For clarity: the point is not that we could not build a deep-net that would be doing better at sketches. Although I predict it to be rather hard because of the lack of features, I suspect that with a focused training set one could do much much better than the models I've tested. The point is, that humans simply and effortlessly generalise to perceive sketches contrary to deep nets which effortlessly generalise to amorphous blobs of right textures. And that is telling.
In summary, the thesis of this post is that visual perception is NOT a solved problem and requires more work, likely a fundamental shift. PVM is one step in a new direction, and likely more steps need to be taken.