Just how close are we to solving vision?

There is a lot of hype today about deep learning, a class of multilayer perceptrons with some 5-20 layers featuring convolutional and polling layers. Many blogs [1,2,3] discuss the structure of these networks, there is plenty code published so I won't get into much detail here. Several tech companies had invested a lot of money into this research and everyone has very high expectations on performance of these models. Indeed they've been winning image classification competitions for several years now and media are reporting  superhuman performance on some visual classification tasks once in a while.

Now just looking at the numbers from ImageNet competition is not really telling us much on how good these models really are, we can only maybe confirm that they are much better than whatever came before them (for that benchmark at least). With media reporting superhuman abilities and high ImageNet numbers and big CEO's pumping hype and showing sexy movies of a car tracking other cars on the road (2min video looped X times which seems a bit suspicious) one can get the impression that vision is  a solved problem.

In this blog post (and a few others coming in the next days and weeks) I'll try to convince you that this is not the case. We are barely scratching the surface and perhaps the current ConvNet models are not even the right way do vision.

Vision is NOT a solved problem

Not at least as of 2016. In order to make this discussion a bit more concrete I took the state of the art DeepNet (RESNET-50 in this case, but tried VGG-16 as well) pretrained on ImageNet, (I used recent version of Keras which makes such experiment really simple) went out of my office, took some movies on the street and in the corridors and ran them through the networks to see what can they figure out. I have classified the entire image (classes appear at top left) and then three crops from left to right to get a finer idea of what the network "sees". I cut of the predictions with confidence below 0.2 (one set of videos) to get the idea of what the network is "thinking" even if it is not confident and then at 0.6 to isolate the cases where the network is more confident. The results can be seen in exhibits below.

Exhibit 1. Resnet 50, confidence level 0.2 and more. Street view.

Exhibit 2. Resnet 50, confidence level 0.6 and more. Street view.

Exhibit 3. Resnet 50, confidence level 0.2 and more. Office view 1.

Exhibit 4. Resnet 50, confidence level 0.6 and more. Office view 1.

Exhibit 5. Resnet 50, confidence level 0.2 and more. Office view 2.

Exhibit 6. Resnet 50, confidence level 0.6 and more. Office view 2.

So what does this tell us? Several immediate things:

  1. ImageNet has somewhat ridiculous selection of labels.  There are many breeds of dogs, strange animals, models of cars but few descriptions of things of general practical relevance.
  2. Even the award winning networks trained for weeks on the most powerful GPUs often make completely bogus judgements on what they see in an average ordinary street view.
  3. Such visual system has limited (if any) use for any autonomous device.
  4. The deep net is often able to capture the gist of the scene, but the exact descriptions are often horribly wrong.

Now an enthusiast would say that I just have to train it on the right set of relevant categories. E.g. train to classify street, curb, street signs and so on. This is certainly a good point and a network specialised for such items will undoubtably be better (e.g. check the excellent demo from ClarifAI, which uses much broader categories that capture the gist of the picture and therefore appear to be a lot more accurate) . My point here is different: notice that the mistakes that those models make are completely ridiculous to humans. They are not off by some minor degree, they are just totally off. So much for superhuman vision.

So where do the reports of superhuman abilities come from? Well since there are many breeds of dogs in the ImageNet, an average human (like me) will not be able to distinguish half of them (say Staffordshire bullterrier from Irish terrier or English foxhound - yes there are real categories in ImageNet, believe it or not). The network which was "trained to death" on this dataset will obviously be better at that aspect. In all practical aspects an average human (even a child) is orders of magnitude better at understanding/describing scenes than the best deep nets (as of late 2015) trained on ImageNet.

In my next post in a few days I will go deeper into the problems of deep nets and analyse the so called adversarial examples. These special stimuli reveal a lot about how convolutional nets work and what their limitations are.

If you found an error, highlight it and press Shift + Enter or click here to inform us.