The Deep Nets are the hot thing these days in machine learning research. So hot that institutes are being established to study the social consequences of AI overtaking humanity and the White House has concerns regarding AI. Now every respecting sceptic should ask a question: is humanity really that close to solving the secret of intelligence? Or maybe this is just hype like in the 50'ies and 80'ies?
This is a long discussion. I will post many articles on that in the future hopefully. Here lets dissect a few popular myths:
- Convolutional deep nets solve perception. It is true that these systems have won ImageNet by a substantial margin and often can classify the content of the image accurately. It is also known that they get fooled by stuff that certainly would not fool a human. So that indicates that there is something missing. I think that we have somewhat shallow understanding of what perception really is. Vision is not about just categorising what we see. In fact we more often than not ignore the class of what we see. Humans or animals are more interested with affordances, namely "can I perform an action on what I see". Is it a chair? Do we classify that based on a million chairs we saw? Or do we classify that by the affordance? I can sit on it, it is of the right size, orientation, therefore it's a chair. I'd say there are a few more things to solve before we can confidently say that perception has been solved, I'm looking particularly at robotics where you really need perception. Looking at the recent DARPA robotics challenge we may be looking at many years of research...
- Convolution and feature maps are a biologically sound mechanism. Well not really. From all we know in biology there is no "convolutional operator" in the cortex. There is no known mechanism of weight sharing. In fact the receptive fields vary enormously depending on the eccentricity from the fovea. Convolution and clumping lookalike features by the pooling operators is at best inspired by certain biological observation, namely complex receptive fields in primary visual cortex. Researchers don't really know if it applies in higher cortical areas, neither do they know how such clustering is accomplished in V1. But almost certainly it is not stiff weight sharing and a max operator.
- Translational invariance is universal. To a certain degree on rectilinearly projected photos with objects at a distance in which perspective change is not substantial it is the case (namely the object retains the same appearance as it translates from one part of the image to another). But this is not universal across all signals, even in visual domain. For example a highly distorting fisheye lens will generate a different group of invariances (with this I mean that the same object will have different appearance in the center and at the edge). Translational invariance may be applicable in spectrograms for auditory domain (which is why deep nets work for speech recognition), but that is it. Other kinds of data may not have translational invariance at all or in less severe case translational invariance will only be a rough approximation to the real mapping.
- Convnets are a good model of primate visual system. Well, not really. We don't know much concrete stuff from biology, but what we do know, is that there is massive feedback in all cortical areas. In fact there are more feedback than feedforward connections. We also know that the general behavioural context affects greatly the responses, even in the primary visual cortex. In other words, that feedback is not to make it look neat. It is actually constantly used. One could argue that convnets have feedback in the form of back propagated error, but this is not what I'm talking about. The feedback connectivity causes instantaneous effects. It's not just learning dynamics but the whole perceptual process is affected. So, granted we may have nothing better, convnets are not really a good model for biological visual hierarchy, lets be honest about it.
So as much as the deep learning field has some impressive accomplishments, I don't think we are at the stage where all the pieces of the puzzle are on the table and they just need some organising and scaling up. I think we (researchers in the field) are lacking many fundamental insights and it will take some serious research to get those models into shape (possibly the whole conv-net deep architecture is actually a dead end). Some of that research we are doing at BrainCorp and I hope to soon be able to tell more about it. This is a long subject so let me stop here for now.