Intro
Since many of my posts were mostly critical and arguably somewhat cynical [1], [2], [3], at least over the last 2-3 years, I decided to switch gears a little and let my audience know I'm actually a very constructive, busy building stuff most of the time, while my ranting on the blog is mostly a side project to vent, since above everything I'm allergic to naive hype and nonsense.
Nevertheless I've worked in the so called AI/robotics/perception for at least ten years in industry now (and prior to that having done a Phd and a rather satisfying academic career), I've had a slightly different path than many working in the same area and hence have a slightly unusual point of view. For those who never bothered to read the bio, I was excited about connectionism way before it was cool, got slightly bored by it and got drawn into more bio-realistic/neuroscience based models, worked on that a few years and got disillusioned, then worked on robotics for a few years, got disillusioned by that, went on and got a DARPA grant to build the predictive vision model (which summarized all I learned about vision at the time), then went more into machine vision and now I'm at cashier-less stores. I still spend a good amount of time thinking on what would be a good way to actually move AI beyond the silly benchmark beating blood sport it became right now. In this post I'd like share some of that agenda, in what I call the "no bullshit" approach to AI.
The "A.I."
The first thing that needs to be emphasized that all of what we currently call AI are computer programs. There are different ways in which such programs are constructed, but nevertheless all of it, including the deepest of neural nets are just code running on a digital machine, a collection of arithmetic and branching operators. One might ask, why make that statement, isn't it obvious? Well it is and it isn't. A lot of people think that when something is called a neural net, then it is some fundamentally different kind of computation. It isn't. Also, that helps us understand not to expect miracles out of a neural net utilizing certain number of FLOPS. It is hard to get an intuitive sense of how FLOPS translate to "Intelligence", but at least we could expect that a network utilizing same compute resources as say GTA V when rendering full HD image, may be to some degree impressive, but will certainly not have enough power to understand and reason about a complex visual scene (since rendering a scene is highly optimized and MUCH simpler than doing the reverse).
There are three basic ways of constructing computer programs:
- Manually type in the program using a programming language. This approach is preferable when the problem at hand can be expressed as an algorithm - a clear series of operations and data structures to perform desired task. Most industrial automation fall here, and obviously most of the applications we use daily, ranging from text editors, through graphics software, to web browsers. The drawback of this approach is obviously, that a code needs to be written by somebody competent, and that labor is typically expensive. Also, even though programmers are competent, there are no guarantees as to whether the program will always work. Getting to the point when code can be guaranteed to work in critical application often requires 10x the effort, including rigorous testing and formal checking and even then sometimes it is impossible to prove that the program will always work.
- In the 80's a new concept matured in which a program would be generated from a high level logical specification. This gave rise to expert systems and functional programming in which the "programmer" does not necessarily state how to solve a problem, but rather expresses the problem in some very formal logical language (such as e.g. Prolog) and then complex logical engine behind it converts into a machine level code, carries out the computation and returns an answer. This approach has a few potential benefits - the programmer in principle does not have to know how to solve a problem, just formally express it. Assuming the functional language itself has been formally tested, sometimes guarantees can be made as to the correctness of the answer and scope of operation. And finally, this is an example of meta-programming in which the computer itself generates a potentially complex code based on the high level description. The promise of the expert systems was that at some point any lay person will be able to verbally express to the computer what they want, and then the computer will be able to interpret that, and execute. That vision never materialized. First of all, the assumption that it is easier to express a problem in formal logic than in high level declarative language is flawed. In practice often problems are not even fully specified until a lot of code is already written and certain number of corner cases explored. In typical programming practice, the specification is constructed along with the program itself (akin to composing music and playing it at the same time - perhaps some composers will be able to write an opera using their head and pencil without ever touching a piano, but most mortals will not). And finally the number of problems that effectively lend themselves to solution using expert systems turned out to be very limited. Nevertheless, there are a few things that the expert systems era brought us which are in active use such as lisp, closure, prolog, computer algebra systems such as Maple and Mathematica etc.
- Programs can also be generated based on data. This approach is generally called machine learning and is a mixture of extreme statistics and some clever optimization. This approach has been around for many years, at least since the 60's, Rosenblatts perceptron through Vapnik SVM's , through Backpropagation which is attributed to Rumelhart, Hinton and Williams, but was detailed by Paul Werbos more than a decade earlier and used by others even earlier than that. The idea, is often related to connectionism - a movement by which a complex computation were to be a result of emergent properties of large number of small, connected, simple units. A notion largely inspired by how we assume the brain does it, but certainly related in spirit to the Conway's game of life, cellular automata and Hopfield networks (later evolved into so called Boltzmann machines). In the paradigm of machine learning, a some sort of malleable computational substrate (essentially a data structure of some kind) is being transformed by some meta-algorithm and those transformations are influenced by additional data. For example a classifier composed of dot product of data vector with a weight vector followed by some nonlinear operation can be adjusted in order to classify input vectors based on known assigned labels. Or e.g. a neural network can be exposed to millions of images and made to classify these images based on pre-labeled class.
This approach has many benefits, although for many years was deemed impractical and non scalable. The benefit of course is, that technically we don't need a programmer at all. All we need is data, and that seems to be much easier and cheaper to obtain (even if it needs to be manually labeled). The second benefit is that as it turned out this approach is somewhat scalable (though there are severe limits, more on that later), the entire story of the last decade and deep learning is a result of that realization regarding scalability and ability to train slightly deeper models. However there is a price to pay as well: when the data structure being shaped by the learning algorithms is large, it becomes nearly impossible to understand how the resulting network works when it does, and more importantly how it fails when it fails. So in many ways these "neural networks" are the polar opposites of expert systems:
Neural networks Expert systems Hand written programs Work with a lot of potentially dirty data Work with crystalline formal specifications Work with somewhat dirty specifications and small sample data Are very opaque, there are no guarantees as to the mode of failure and range of operation Can be opaque but often do give very strict and strong limitations and conditions of operation. Can be understood and fixed, strong guarantees are often impossible, but reasonable "soft" guarantees about subparts of the computation + extensive unit testing is often enough for most applications Lend themselves for medium level of parallelization, mostly in SIMD model using GPU Can parallelize sometimes, but the complexity of the input spec typically precludes from large scaling. Most of the time are rather serial, since most of what can be described as an algorithm is sequential. Humans generally are not very good at writing parallel programs beyond the most trivial cases. Seem to be good for low-level perception, not very good for generating symbolic relations and solving symbolic problems. Great at solving symbolic problems once they've been formulated in input spec. Terrible at anything else. Great of anything in between, which is most of well defined problems solvable by a computer. Requires a phd or at least a grad student in machine learning, since getting these things to learn anything beyond available examples is quite tricky, plus a lot of labeling. Requires a mathematician trained in formal logic and able to program stuff in functional languages, a species nearly extinct. Requires a skilled programmer, often hard to find. Expensive if you want to do anything besides the most basic examples. Expensive, since people skilled in this art are nearly impossible to find these days. Expensive unless you settle with a script kiddie who will screw you over and write some unsustainable spaghetti code.
In practice what does all this mean?
In practice we have certain set of puzzles from which we can build more complex programs, but do we have all the puzzles? Is there anything missing?
In my opinion there is. The current generation machine learning and expert systems are in a stark contrast - Gary Marcus argues that they are complementary. This is a somewhat attractive idea (more on that later). But in my personal opinion there is a giant hole missing between the two: a connectionist-like architecture that could develop high level symbols and operate on them much like expert systems do. This is similar to how we seem to operate based on introspection of our own conscious experience - we are able to acquire low level input, dirty data, extract and abstract the relevant symbols in that data and manipulate these symbols in more or less formal way to accomplish a task. We don't know how to do it right now. Gluing deep nets with expert systems seems like the first thing we could try (and in fact many arguably successful things such as AlphaGo as well as many robotics control systems, including pretty much every so called self driving car out there) is in large degree such an hybrid system, no matter how much connectionists want to argue otherwise.
Deep learning is used to do perception and any symbolic relationships of these percepts (such as what to do when a stop sign is detected etc) are done via mostly symbolic, rule based means. But these examples are in ways very much imperfect and clumsy and try to merge two worlds which are just too far apart in abstraction. In my opinion there are possibly many levels of conceptual mixed-connectionist-symbolic systems between these two extremes capable of progressively more abstract understanding of input data and developing more abstract symbols. I anticipate the span of tasks such systems would perform will generally fit under the "common sense" label - things the current deep nets are too dumb to figure out on their own, and we are too "dumb" to quantify in any meaningful manner for an expert system to work with (see the failure of CYC). In other words deep learning front ends are too basic to perform any non-trivial reasoning, while the expert systems are still to high level and crystalline to provide for any common sense reasoning. Something is else needed in-between.
Problems with stitching deep learning and symbolic approaches
There are two major problems I see in merging deep learning with symbolic methods:
- Whatever we call symbols, are just the tip of the ice berg. The things that form verbalizable symbols at our conscious level, and hence the symbols we can express easily with language are at a very high level of abstraction. There is very likely at lot more "lower level symbols" which never make it to our conscious perception, which still are a large part of how we process information. In visual domain there could be "symbols" for representing the coherence of shadows in a visual scene or shadow ownership. For expressing dynamic stability of observed scenes. For expressing coherence of motion in the observed scenes. For expressing border ownership (that a given contour belongs to a particular larger object). None of this is something we have words for and therefore don't perceive these things as symbolic. And since we don't see these things, we don't label datasets with them and hence these "symbols" never make it to AI, neither from the symbolic approach, nor machine learning approach. Some of these "symbols" might look like high level features developed in deep nets, but the problem is, deep nets never treat those features as symbols and utilize only the most basic, statistical relationships between them.
- The symbolic operations don't feed back to the lower level perception. E.g. when a low level perception system detects two objects, while the rule based high level symbol manipulator concludes that these two objects cannot be detected at the same time (e.g. two intersection signals facing same way with conflicting states etc.), that information is not being used to improve the lower level or even to disambiguate/modulate to find out where the error originated. Our brains do it all the time, on a myriad of semi-symbolic levels we continuously try to disambiguate conflicting sources of signal. The problem is, even if we had some modulation between the two, they are still so far apart in abstraction, it could be enormously difficult to actually interpret this high level symbolic clash in the very basic deep net feature space
These are the two main reasons why I think this hybrid approach is not going to go across the "common sense" barrier, although again, in practice a lot of domain specific applications of AI will end up using such hybrid approach since this is really the only thing we know how to do right now. And as a practical person I have nothing against such mixtures if they are applied in domains where they have a high chance of actually working (see more below).
Constructive way
So what to do? There are two paths for making something with AI, one practical/commercial and the other, potentially a lot more difficult and long term - scientific. It is important not mix the two, or when you have to mix them to understand that any advancements on the science side may not lend themselves for commercialization for many years to come, and what might actually be a commercial success, might look from the scientific point of view as rather dumb (such as some hybrid Frankenstein architecture).
The commercial way
If you want to build a business with AI you have to get down to earth, look at the technology and ask yourself honestly - what kind of applications can we actually build with this?
We know that machine learning models are inherently statistical in nature, and even though can achieve what seems to be difficult or impossible to write by hand (or via a symbolic system), they occasionally fail in very unpredictable way and are generally brittle as soon as the sample at hand gets slightly out of the domain in which the system had been trained. The best example of that are so called adversarial examples in computer vision - a minuscule perturbations in the input image which can throw off even the best deep learning models. Putting all of this together we may describe the tasks we can and cannot solve:
- Any machine learning model you use in your flow will inherently have some (a few percent) level of errors, unless your input data is extremely well defined and never goes outside of a very narrow set of specs. But if your data is regular like that, you might actually be better off with simpler machine learning techniques, such as some clever, data-specific features and SVM or something. Or even custom, hand written algorithm.
- Assuming any ML based piece of your system makes a few percent error, you have to establish if errors in your application average out or compound. If you can average your errors out, great, your idea has a chance. However, most applications in real world control and robotics compound errors and that is a bad news.
- It also matters how severe are errors in your application. If they are critical (the cost of making a mistake is large and mistakes are irreversible), that is a bad news. If your errors are irreversible and you can't average them out, and your application is in open domain where there will never be a shortage of out of distribution samples, you are royally screwed. That is why I'm so skeptical of self driving cars, since that is exactly the application described. And we see it clearly in the diminishing returns of all these AV projects, while all of them are still substantially below a level of practicality (not to mention economic feasibility).
- If your errors are reversible and don't cost much, then there is a fair bit of hope. And that is why I'm currently working on building a cashierless store: even if the system makes a mistake, it can be easily reverted when either human reviewer detects it, or the customer himself files a complaint. The system can make money even when it is just 95% or 98% performance level. It does not need to follow this asymptote of diminishing returns to get to 99.99999% reliability before it becomes economically feasible.
Notably the stuff deep learning is mostly successfully used for these days is not mission critical. Take Google image search. When among the images you search you get a few mistakes, no one cares. When you use google translate and occasionally get some nonsense, no one cares. If a advertisement systems occasionally displays something you already bought or aren't interested - no one cares. When you Facebook feed gets populated with something you are not interested with, no one cares. When Netflix occasionally suggests a movie you don't like, no one cares. These are all non critical things. When however, your car occasionally crashes into a parked firetruck, then that is a problem.
Furthermore for practical recommendations, in my opinion a good practice, is to not shy away from better sensors, just because some black magic AI can accomplish the same thing according to some paper [where it was typically evaluated only on some narrow dataset, never looking at out of distribution samples]. Best example, Tesla and their stubborn approach against LIDAR. Yes, I agree, LIDAR for sure is not the ultimate answer to the self driving car problem, but it is an independent source of pretty reliable signal that is very hard to obtain otherwise. Elon Musk silly approach against LIDAR certainly costed some people their lives, and probably many many accidents. As of today, a $700 ipad has a solid state LIDAR built in (granted probably not automotive grade), and automotive LIDARS are available at the order of a few thousands of dollars. That seems like a lot but is actually less than most Tesla rims. And even if Tesla deeply believes LIDAR is useless, had they mounted them in their large fleet it would provide for enormous amount of automatically labeled data that could be used to train and evaluate their neural net, and could be phased out once the results with the camera based approach reached a satisfactory level (which BTW I doubt they ever will for a number of reasons described above). That is obviously on top of keeping their customers much safer in the interim.
Anyway, bottom line, there are lots of cheap sensors out there that can solve a lot of the problems which AI is advertised to solve using e.g. only camera. At least at the beginning don't shy away from these sensors. Their cost in the end will be negligible and they will either solve the problem at hand entirely or allow you to improve your machine learning solution by providing the ground truth.
The science way
The scientific approach is really what this blog was all about, before it veered into making cynical posts about the general AI stupidity out there. As much as I will probably continue to poke fun at some of these pompous AI clowns, I want to generally switch gears into something more constructive. At this point those who see the nonsense, don't need to be convinced anymore and those who don't, well they can't really be helped.
Anyway the scientific approach should aim to build a "machine learning" system, that could develop high level symbolic representations and be able to spontaneously manipulate these symbols, and in doing so, modulate the state of even the very primitive front end (hence it needs to have feedback and be massively recurrent). This can also be viewed as bringing the top-down reasoning into what currently is almost exclusively bottom-up. For this to happen I anticipate several steps:
- Forget about benchmarks. Benchmarks are useful when we know what we want to measure. The current visual benchmarks are to a large degree saturated, and don't provide a useful signal as to what to optimize for. ImageNet had been effectively solved, while more general computer vision still remains rather rudimentary. This is somewhat problematic for young researchers who's only way to make a name for themselves is to publish in literature, where benchmarks are treated like a holy grail. Hardly anybody in the field cares about ideas at this point, just squeezing the last juice out of the remaining benchmarks, even if it's just a result of some silly meta-optimization ran using obscene compute resource at some big tech company. Likely you'll have to play that game for a while before you have enough freedom to do what you really want to do. Just remember it's just a game in a purely academic *&^% measuring contest.
- Take on tasks which span smaller range of levels of abstraction. Image classification into human generated categories spans enormous distance in abstraction level. Remember, the computer does not know anything about the category you give it to recognize, other than what correlates with that label in the dataset. That is it. There is no magic. The system will not recognize a clay figure of a cat, if it only had seen furry real cats. The system won't even recognize a sketch of a cat. it has absolutely no idea about what a cat is. It does not know any affordances, or relations a cat may have to it's environment. It ain't in a dataset, it ain't going to emerge in the deep net, period. And even if it is in the dataset, if it does not represent a noticeable gain in the loss function, it might still not get there, since the limited number of weights will get assigned to stronger features that give better gains in the loss.
Now by tasks spanning into less abstract labels, I think these are interesting:
- Self supervised image coloring - fun task to train and useful for making your old pictures shine in new colors. And moreover, training data is readily available, just take a color picture, convert it to black and white and teach your network to recover the colors. This has been done before, but it's a great exercise to start. Also run a comparison between ground truth and colored pictures, note that it is easy to colorize a picture to "look good" while it may have nothing to do with the ground truth (see example below colorized from black and white version using state of the art colorizing system and ground truth on the right)
- Context based filling in - same as above, data available everywhere. And much like above, it's not some highly abstract human label we are trying to "teach" here, but rather a relatively low level "pre-symbolic" feature.
- Identify the source of illumination of a scene. Slightly more difficult in that we'd actually need to generate the data, though as for start that data could be rendered using computer graphics engine. I'm not sure anyone tried that before but surely that would be very interesting.
- Related to above, detect inconsistencies (errors) in scene illumination. That is something humans do rather spontaneously. If all objects in a scene have their shadows to the right consistent with a light source of the left, and one objects suddenly had a shadow in opposite directions, that quickly triggers a reaction.
- Detect mirrors. We do it without thinking, we understand that there is certain unique geometry of reflection and can easily tell apart a mirror in a scene from anything else. Try to get that same performance from a deep neural net, and you will be surprised. This task would again need new data, but could also be bootstrapped using rendering software.
- Related to mirrors, one exercise that could be simpler from data point of view is to try to create a classifier detecting if the input image contains an axis about which it is reflected - see example photos below - left with a reflection, right regular picture. Get some human data to compare as well, most likely humans will have no trouble at all on this task. This task again cannot be "cheated" by some low level feature correlation and requires a system to know something about what is going on geometrically in the image. I plan to try this myself, but I'm pretty certain all the current state of the art visual deep nets will fail miserably on this one.
- Try any number of things similar to the above and finally try to train them all in the same network.
- I'm not sure any of the current architectures would work, since I don't think it would be anything easy for a convolutional network to establish the relationships a scene needs to adhere to in order to determine there is a mirror reflection in the scene. It is very likely that a network would have to see a sequence, multiple frames when a mirror is showed and the particular geometric relationships are exposed in multiple contexts. That leads to to the problem of time.
- We don't learn visual scenes like the contemporary deep nets do, we see them unfold in time. And it is very likely deep nets simply cannot extract the same level of understanding of visual scenes as we do. So we arrive at video streams and temporally organized data. One avenue which seems promising is to try temporal prediction or some combination of temporal and spatial prediction. But it is not clear at all what it means to be able to predict a video frame. Since these are very high dimensional entities, we hit the age old problem of measuring distances in high dimensional spaces... So my current view in this, several years after publishing our initial predictive vision model, I think we need to get back and rethink exactly what are we trying to optimize here.
- Self supervised image coloring - fun task to train and useful for making your old pictures shine in new colors. And moreover, training data is readily available, just take a color picture, convert it to black and white and teach your network to recover the colors. This has been done before, but it's a great exercise to start. Also run a comparison between ground truth and colored pictures, note that it is easy to colorize a picture to "look good" while it may have nothing to do with the ground truth (see example below colorized from black and white version using state of the art colorizing system and ground truth on the right)
Either way, these tasks, although in my opinion absolutely necessary to move the field forward, in their very statement reveal that our current capabilities are really, really rudimentary. Hence good luck finding funding for these sorts of things. At some point people will realize that we have to go back to the drawing board and pretty much throw out almost all of deep learning as a viable path towards AI (by that I mean most of the recent hacks and tricks and architectures optimized specifically for the benchmarks, backprop algorithm will stay here for good I'm sure), but who knows when will that realization come.
Summary
Deep learning is no magic and is just yet another way to construct computer programs (which we happen to call artificial neural nets). In ways machine learning is a polar opposite of earlier attempts to build AI based on formal logic. That attempt failed in the 80' and caused an AI winter. Deep learning, even though enjoying great success in non critical applications, due to it's inherent statistical nature is bound to fail in applications where close to 100% reliability is needed. Failure of deep learning on delivering of many promises will likely lead to a similar winter. Even though Deep Learning and formal symbolic systems seem like polar opposites, combining them together will not likely lead to AI that could deal with common sense, however might lead to interesting, domain specific applications. In my opinion, common sense will emerge only when a connectionist like system will have a chance to develop the internal symbols to represent the relationships in physical world. For that to happen, a system needs to be exposed to temporal input representing the physics and has to somehow be able to encode and represent the basic physical features of input reality. For the time being no one really knows how to do it, but worse than that, most so called AI researchers aren't even interested in acknowledging the problem. Unfortunately the field is a victim of it's own success - most funding sources have been lead to believe the AI a lot more capable than it actually is, hence it will be very difficult to walk these expectations back and secure research funding, without the trust in the field first imploding, repeating the pattern that cursed AI research from the very beginning of the field in 1956.
If you found an error, highlight it and press Shift + Enter or click here to inform us.