In the previous posts I've been investigating the current state of the art deep nets for casual vision application - telling what is in the image taken in an average office and average boring street. I've also played a bit with adversarial examples to show how the deep nets can be fooled. These failure modes tell us something important about the level of perception we are dealing with - very basic level. In this post I will discuss why I think perception is such an elusive problem. Let's begin with vision.
Each of us is born with a blindspot in their visual field - the place where nerve fibres from the retina exit the eyeball. However, unless somebody tells us how to discover it, we are completely ignorant of its existence. In some sense it could be qualified as an example of anosognosia - a condition in which humans are not aware of a defect in their perception. A more extreme case of this is known as Anton-Babinski syndrome, typically occurring after a brain damage in which the patient claims to see even though he is technically blind! As much as this seems unbelievable, patients will confabulate logical narrations to explain their apparent inability to solve visual tests and go into a strong denial of their condition. These conditions are very telling: our conscious access to our visual experience is very limited. What our "consciousness" is presented with is not the visual input that goes through our eyes, but already very refined, internally generated model of surrounding reality, with many holes already patched, many ambiguities resolved and many hypothesis made up. In the extreme case of Anton-Babinski syndrome this model appears to be visual (it is likely constructed from the information available in other modalities) even though it clearly cannot be (due to evident cortical blindness) and convincing enough to cause confabulation and denial. Similarly in the neglect syndrome patients may live for many years without ever realising that they cannot perceive a substantial part of their visual field. To them, that inability is just as concealed as the blindspot is to us all. In fact some of you, dear readers may realise for the first time in your life while reading this text that you suffer from any of these conditions (unlikely as it typically involves a massive amount of unquestionable evidence before anybody is able to admit that their perception is broken).
So what we perceive as visual experience is already a cross modal disambiguated construct that we consciously explore and "understand" and primary part of our vision is completely subconscious process. We are then faced with the problem of conveying such level of understanding of the world to a computer and then is when we realise how hard it is and that something is missing. Getting a computer to visually perceive anything is a big problem and maybe that is why even the rudimentary capabilities of the current deep nets trigger such excitement.
The problem of introspection
Many studies in AI are in essence inspired by insights gained from introspection. We are for example able to perform verbal logical derivations so for many years people were thinking that intelligence is about logic. We are able to seamlessly classify objects in our visual field, so people think that vision is about classifying objects. We are able to hear sounds that make up words, so people think that understanding language is all about classifying pieces of audio spectrum. In my opinion all of these statements are equally flawed, since all of them originate at the conscious level on our minds where things are already "classifiable" , "logical", "predictable" and separated into functional entities. Realise this: a picture of a coke can on a table has no information in it which may allow you to figure out that those are two separate objects. The only reason that you perceive them as separate objects is because you bring that as a prior from previous interactions with those objects. A neural network that never interacted cannot derive that sense of "objectness" of things, particularly when it is trained like contemporary deep nets by flashing millions of random pictures and associating them with human conscious level functional/cultural categories.
In fact we may not realise how the entire western culture is affected by the greek ideas of entities and reductionism and that there are other, more holistic ways of looking at the world where objects are no longer as clearly separated from the rest of the reality as with this train of though, but that is a subject for a whole another post or a book.
Vision and common sense
The problem with vision is similar to many other problems with AI - the stuff we need to figure out is quite basic. So basic that many people won't even acknowledge it is a problem to begin with and hardly any funding agency or company will fund it. These problems lie exactly in our blind spots, that is why we cannot see them. So instead people try to simulate higher levels of cognition (exemplified by the race to solve one game after another, recently go, see my other posts). Now these attempts obviously end up as giant hacks that kind of work (for the particular narrow application) but not really (in general). Some niche applications are found on the way (which might occasionally be big from the market perspective), but the general marvellous AI never materialises.
This lead me to coin the following definition of contemporary AI studies:
"Artificial Intelligence is when intelligent people use artificial tricks to make some stuff kinda work".
Figure 1. One of the cycles in AI research. Or perhaps the current research agenda?
What we really need to do is to develop perception as basic understanding of surrounding reality (sounds trivial right?). Roboticist's have been building explicit models for the robot body and its surroundings by inserting and solving physical equations. This is not scalable, hence the progress in robotics has been slow. What I think needs to be done, is such models need to be learned from the interaction with the environment. These models need to learn all the stuff we call "common sense" and never spend too much time talking about (since it's so uninteresting). There are likely various aspects of "sense", ranging from illumination coherence in the scene, up to perspective and reasonable sizing of objects (that is in visual domain, other domains may have equivalents). To learn all these things we need massive amount of parameters and a recurrent system. To train such a system one needs a paradigm - a learning method that would allow for this in a scalable way. I argue that online prediction can be such a paradigm (see below).
Humanity needs to develop the equivalent of a mouse brain, than a dog's brain, monkey's brain etc. before we can announce the true AI (in fact then it will likely be real intelligence just in an "artificial substrate"). Problem is, every step of this effort will require enormous resources and neither shows clear applications (until we get to say primate level). Imagine you can replicate the cognitive abilities of a mouse, what would you use that for? It can go after cheese and bugs, avoid cats and will generally survive for several months in average environments. Does it have any big applications? Hard to tell. What if you had a dog's brain? It can do "fetch" on the beach pretty well and guide blind people but is this application big enough to justify the large development costs? It is much easier to sell people on the story "today it can play Chess and Go better then the world champion, tomorrow it will be able to predict the future of mankind and solve all our problems", which is obvious bullshit.
We have done some initial work which I hope will eventually gain momentum as the new approach to AI, explained in detail in our paper about Predictive Vision Model (PVM) (code on github). This is just the beginning and it will likely take years before we have something that is able to "see" in broader sense of that word. But the direction is a lot more clear now than it was just a few years ago, and there is a clear economical force that drives progress in this direction (which is visible in the hype around deep learning). The only problem is whether the hype will crash and cause another AI winter (well deserved in some sense) slowing down the development of PVM and other such endeavours as a side effect or will it more gradually move into something that makes more sense than current feedforward approaches... Time will tell.