Deep learning has been at the forefront of the so called AI revolution for quite a few years now, and many people had believed that it is the silver bullet that will take us to the world of wonders of technological singularity (general AI). Many bets were made in 2014, 2015 and 2016 when still new boundaries were pushed, such as the Alpha Go etc. Companies such as Tesla were announcing through the mouths of their CEO's that fully self driving car was very close, to the point that Tesla even started selling that option to customers [to be enabled by future software update].
We have now mid 2018 and things have changed. Not on the surface yet, NIPS conference is still oversold, the corporate PR still has AI all over its press releases, Elon Musk still keeps promising self driving cars and Google CEO keeps repeating Andrew Ng's slogan that AI is bigger than electricity. But this narrative begins to crack. And as I predicted in my older post, the place where the cracks are most visible is autonomous driving - an actual application of the technology in the real world.
The dust settled on deep learning
When the ImageNet was effectively solved (note this does not mean that vision is solved), many prominent researchers in the field (including even typically quiet Geoff Hinton) were actively giving press interviews, publicizing stuff on social media (e.g. Yann Lecun, Andrew Ng, Fei-Fei Li to name a few). The general tone was that we are in front of a gigantic revolution and from now on things can only accelerate. Well years have passed and the twitter feeds of those people became less active, as exemplified by Andrew Ng below:
2013 - 0.413 tweets per day
2014 - 0.605 tweets per day
2015 - 0.320 tweets per day
2016 - 0.802 tweets per day
2017 - 0.668 tweets per day
2018 - 0.263 tweets per day (until 24 May)
Perhaps this is because Andrew's outrageous claims are now put to more scrutiny by the community as indicated in a tweet below:
Visibly the sentiment has quite considerably declined, there are much fewer tweets praising deep learning as the ultimate algorithm, the papers are becoming less "revolutionary" and much more "evolutionary". Deepmind hasn't shown anything breathtaking since their Alpha Go zero [and even that wasn't that exciting, given the obscene amount of compute necessary and applicability to games only - see Moravec's paradox]. OpenAI was rather quiet, with their last media outburst being the Dota 2 playing agent [which I suppose was meant to create as much buzz as Alpha Go, but fizzled out rather quickly]. In fact articles began showing up that even Google in fact does not know what to do with Deepmind, as their results are apparently not as practical as originally expected... As for the prominent researchers, they've been generally touring around meeting with government officials in Canada or France to secure their future grants, Yann Lecun even stepped down (rather symbolically) from the Head of Research to Chief AI scientist at Facebook. This gradual shift from rich, big corporations to government sponsored institutes suggests to me that the interest in this kind of research within these corporations (I think of Google and Facebook) is actually slowly winding down. Again these are all early signs, nothing spoken out loud, just the body language.
Deep learning (does not) scale
One of the key slogans repeated about deep learning is that it scales almost effortlessly. We had the AlexNet in 2012 which had ~60M parameters, we probably now have models with at least 1000x that number right? Well probably we do, the question however is - are these things 1000x as capable? Or even 100x as capable? A study by openAI comes in handy:
So in terms of applications for vision we see that VGG and Resnets saturated somewhat around one order of magnitude of compute resources applied (in terms of number of parameters it is actually less). Xception is a variation of google inception architecture and actually only slightly outperforms inception on ImageNet, arguably actually slightly outperforms everyone else, because essentially AlexNet solved ImageNet. So at 100 times more compute than AlexNet we pretty much saturated architectures in terms of vision, or image classification to be precise. Neural machine translation is a big effort by all the big web search players and no wonder it takes all the compute it can take (and yet google translate still sucks, though has gotten arguably better). The latest three points on that graph, interestingly show reinforcement learning related projects, applied to games by Deepmind and OpenAI. Particularly AlphaGo Zero and slightly more general AlphaZero take ridiculous amount of compute, but are not applicable in the real world applications because much of that compute is needed to simulate and generate the data these data hungry models need. OK, so we can now train AlexNet in minutes rather than days, but can we train a 1000x bigger AlexNet in days and get qualitatively better results? Apparently not...
So in fact, this graph which was meant to show how well deep learning scales, indicates the exact opposite. We can't just scale up AlexNet and get respectively better results - we have to fiddle with specific architectures, and effectively additional compute does not buy much without order of magnitude more data samples, which are in practice only available in simulated game environments.
Self driving crashes
By far the biggest blow into deep learning fame is the domain of self driving vehicles (something I anticipated for a long time, see e.g. this post from 2016). Initially it was thought that end-to-end deep learning could somehow solve this problem, a premise particularly heavily promoted by Nvidia. I don't think there is a single person on Earth who still believes that, though I could be wrong. Looking at last year California DMV disengagement reports Nvidia car could not drive literally ten miles without a disengagement. In a separate post I discuss the general state of that development and comparison to human driver safety, which (spoiler alert) is not looking good. Since 2016 there were also several Tesla AutoPilot incidents [1, 2, 3] a few of which were fatal [1, 2]. Arguably Tesla Autopilot should not be confused with self driving (it is), but at least at the core it relies on the same kind of technology. As of today, aside from occasional spectacular errors , it still cannot stop at an intersection, recognize a traffic light, or even navigate through a roundabout. That is in May 2018, several months after the promised coast to coast Tesla autonomous drive (did not happen although the rumor is they've tried but could not get it to work without ~30 disengagements). Several months ago (in February 2018) Elon Musk repeated in a conference call when asked about the coast to coast drive that:
"We could have done the coast-to-coast drive, but it would have required too much specialized code to effectively game it or make it somewhat brittle and that it would work for one particular route, but not the general solution. So I think we would be able to repeat it, but if it’s just not any other route, which is not really a true solution. (…)
I am pretty excited about how much progress we’re making on the neural net front. And it’s a little – it’s also one of those things that’s kind of exponential where the progress doesn’t seem – it doesn’t seem like much progress, it doesn’t seem like much progress, and suddenly wow. It will feel like well this is a lame driver, lame driver. Like okay, that’s a pretty good driver. Like ‘Holy Cow!’ this driver’s good. It’ll be like that,”
Well, looking at the graph above (from OpenAI) I seem not be seeing that exponential progress. Neither is it visible in miles before disengagement for pretty much every big player in this field. In essence the above statement should be interpreted: "We currently don't have the technology that could safely drive us coast to coast, though we could have faked it if we really wanted to (maybe...). We deeply hope that some sort of exponential jump in capabilities of neural networks will soon happen and save us from disgrace and massive lawsuits".
But by far the biggest pin punching through the AI bubble was the accident in which Uber self driving car killed a pedestrian in Arizona. From the preliminary report by the NTSB we can read some astonishing statements:
Aside from general system design failure apparent in this report, it is striking that the system spent long seconds trying to decide what exactly is sees in front (whether that be a pedestrian, bike, vehicle or whatever else) rather than making the only logical decision in these circumstances, which was to make sure not to hit it. There are several reasons for it: first, people will often verbalize their decisions post factum. So a human will typically say: "I saw a cyclist therefore I veered to the left to avoid him". Huge amount of psychophysical literature will suggest a quite different explanation: a human saw something which was very quickly interpreted as a obstacle by fast perceptual loops of his nervous systems and he performed a rapid action to avoid it, long seconds later realizing what happened and providing a verbal explanation". There are tons of decisions we make every day that are not verbalized, and driving includes many of them. Verbalization is costly and takes time and reality often does not provide that time. These mechanisms have evolved for a billion years to keep us safe, and driving context (although modern) makes use of many such reflexes. And since these reflexes have not evolved specifically for driving, they may induce mistakes. A knee jerk reaction to a wasp buzzing in a car may have caused many crashes and deaths. But our general understanding of 3d space, speed, ability to predict the behavior of agents, behavior of physical objects traversing through our path are the primitive skills, that were just as useful 100 million years ago as they are today and they've been honed really sharp by evolution.
But because most of these things are not easily verbalizable, they are hard to measure, and consequently we don't optimize our machine learning systems on these aspects at all [see my earlier post for benchmark proposals that would address some of these capabilities]. Now this would speak in favor of Nvidia end-to-end approach - learn image->action mapping, skipping any verbalization, and in some ways this is the right way to do it but... the problem is that the input space is incredibly high dimensional, while the action space is very low dimensional. Hence the "amount" of "label" (readout) is extremely small compared to the amount of information coming in. In such situation it is easy to learn spurious relations, as exemplified by adversarial examples in deep learning. A different paradigm is needed and I postulate prediction of the entire perceptual input along with the action as a first step to make a system able to extract the semantics of the world, rather than spurious correlations [read more about my first proposed architecture called Predictive Vision Model].
In fact if there is anything at all we learned from the outburst of deep learning, is that (10k+ dimensional) image space has plenty enough spurious patterns in it, that they actually generalize across many images and make the impression like our classifiers actually understand what they are seeing. Nothing could be further from the truth, as admitted even by the top researchers who are heavily invested in this field. In fact many top researchers should not be too outraged by my observations, Yann Lecun warned about overexcitement and AI winter for a while, even Geoffrey Hinton - the father of the current outburst of backpropagation - admitted in an interview that this likely is all a dead end, and we need to start over. At this point though, the hype is so strong that nobody will listen, even to the founding fathers of the field.
Gary Marcus and his quest against the hype
I should mention that more top tier people are recognizing the hubris and have the courage to openly call it. One of the most active in that space is Gary Marcus. Although I don't think I agree with everything that Gary proposes in terms of AI, we certainly agree that it is not yet as powerful as painted by the deep learning hype-propaganda. In fact it is not even any close. For those who missed it, he has excellent blog posts/papers "Deep learning: A critical appraisal" and "In defense of skepticism about deep learning", where he very meticulously deconstructs the deep learning hype. I respect Gary a lot, he behaves like a real scientist should, while most so called "deep learning stars" just behave like cheap celebrities.
Predicting the A.I. winter is like predicting a stock market crash - impossible to tell precisely when it happens, but almost certain that it will at some point. Much like before a stock market crash there are signs of the impending collapse, but the narrative is so strong that it is very easy to ignore them, even if they are in plain sight. In my opinion there are such signs visible already of a huge decline in deep learning (and probably in AI in general as this term has been abused ad nauseam by corporate propaganda), visible in plain sight, yet hidden from the majority by the increasingly intense narrative. How "deep" will that winter be? I have no idea. What will come next? I have no idea. But I'm pretty positive it is coming, perhaps sooner rather than later.