Recently the world has been thrilled by the game of *go* played between the world top player and a computer program. The program eventually won 4 of the 5 rounds, marking the historical moment in which *go* had finally been solved. This is almost twenty years after another important game - *chess* had suffered similar defeat. Why did it take almost 20 years?

You will hear that *go* is apparently a lot more difficult than *chess* and therefore the search space is much larger and bla bla bla. Well did we know in advance it was so much harder? Probably not until we started trying to solve it. Do we frankly even now have any reasonable **intuition**^{1} as to why *go* is so much harder? I doubt.

OK, let's look at something simpler - graph problems. Some of them are easy, lets say minimum-spanning-tree or even all-pair-shortest-paths. Some of them are extremely hard e.g. traveling salesman problem. Even though these problems sound very similar. It's all about some minimal path in a graph, it would seem they should be similar, yet their solutions vary greatly in complexity.

** Figure 1. **Checkerboard pattern is solved by this net in several hundred training epochs. An example of an "easy" problem in 2d. Generated by the tensorflow playground.

**Figure 2.** Same network as above is struggling after 5000 epochs with the two spirals. Generated by the tensorflow playground.

Similarly with machine learning problems. We are aware that some geometries are harder to solve by neural networks than others, e.g. the famous two spirals problem^{2} compared to some simple gaussian clusters. Those of us who have been working with neural networks for years can sometimes tell if a given 2d separation problem would be easy or hard for a neural net. But our intuitions break down at higher dimensions and sadly most of the relevant problems in perception are very high dimensional.

So let's say we have a task of telling if there is a cat in the image. Is it hard or not? Certainly it's hard if we were to code it but let's say we train it. With contemporary deep nets we know we can do it, reasonably well. Maybe the high dimensional landscape of this task has some convenient symmetries that the network can exploit? Maybe the solution lies on relatively regular low dimensional manifold that can be approximated with good precision. We don't really know for sure. Now let's move on, say the problem is to tell if there is anything actionable in the scene. Or anything dangerous to a human. Or something strange. Or something that is physically impossible (e.g. like Escher's drawings). Do we have any intuitions as to whether these problems are easy or hard or in other words solvable by the current algorithms? Even though they may sound similar, I'm doubtful. We can try solving any of them, create a dataset and find out that it either works fine or not at all. The point is, we really have no idea how complicated region each of these problems carves out in the 10000+ dimensional space it is defined in. Some of these regions might be relatively regular, others could be horrible fractal-like convoluted messy hairballs (see example below of a quaternion Julia set) that would fry any hyperplane based neural network that attempts solving them (high dimensional equivalents of the two spirals). My personal take is that a great majority of perception problems are unfortunately very hairy.

What makes things worse is that we are willing to easily admit that some graph problems are simple and some are not, since we have a relatively well developed theory and no cognitive biases. On the other hand with problems in e.g. visual perception, we have no theory and some really huge cognitive biases on what we think should be simple, since our brain does it so well. We tend to forget that our brain has 80B neurons (while the largest contemporary conv-nets approximately 1M), that are connected to each other in every which way, with feedback, many years of training, working in real time, online. That is nowhere close to where machine learning is right now. There are myriads of problems that are effortless even for dumb humans and will continue to be unachievable for the most sophisticated ML models.

The conclusion from this is this: the fact we can do relatively well on ImageNet does not mean we've solved all the problems of visual perception. Quite the opposite, we barely scratched the surface. Different problems will require new approaches, something else than the Alexnet or your favourite choice of deep convnet. Things get even worse when you add time to the equation and begin asking questions about what is going on in the scene, but that is a subject for a different post.

Keep that in mind next time you promise somebody that you can solve their problem with machine learning.

- By intuition I mean a true, simple to explain intuition, not some elaborate calculation about the size of the decission tree. In fact the decission tree can be large but it might be easy to prune in some other games. ↩
- Kevin J. Lang and Michael J. Witbrock: Learning to Tell Two Spirals Apart. In: Proceedings of the 1988 Connectionist Models Summer School ↩