Those who regularly read my blog are aware that I'm a bit skeptical of the current AI "benchmarks" and whether they serve the field well. In particular I think that the lack of definition of intelligence is the major elephant in the room. For a proof that this apparently is not a well recognized issue take this recent twitter thread:
Aside from the broader context of this thread discussing evolution and learning, Ilya Sutskever, one of the leading deep learning researchers, is expressing a nice sounding empirical approach: we don't have to argue, we can just test. Well, as it may clearly follow from my reply, I don't think this is really the case. I have no idea what Sutskever means by "obviously more intelligent" - do you? Does he mean better ability to overfit existing datasets? Play yet another Atari computer game? I find this approach prevalent in the circles associated with deep learning, as if this field had some very well defined empirical measurement foundation. Quite the opposite is true: the field is driven by a dogma that a "dataset" (blessed as standard in the field by some committee) and some God given measure (put Hinton, LeCun or some other dude in place of God) is the ultimate determination of "intelligence". I already went over some of the dangers of measuring inherently multidimensional object with isolated measures producing one dimensional scores as well as the issues with defining intelligence here and here and I don't want to go over these "obvious" issues again. I don't want to be accused of empty criticism, so here let me just suggest a few new task in the domain of vision, tasks that if accomplished would post an impressive step towards scene understanding. I hope this will give a better context to the discussion.
1 - find mirrors
I suppose that task is self evident, but just to clarify, initially this can be just a binary classification task - given an image answer yes (1.0) if there is at least one mirror in it. Later on one could come up with localizing such as in the mockup above. Don't be fooled by simplicity of the formulation - currently NO system can deal with this task!
2 - attribute shadows
Again, it does not take a lot of explaining to convey the task. This task may in some cases not be extremely well defined, hence at least initially one could simplify it a bit by giving the system the locations of objects of interest. We can also make assumptions about the necessary crispness of shadows and so on. The mockup render above shows the intended structure of labels, but obviously the data does not have to look that synthetic. It is easy to improve the quality/realism by adding textures (see below) or actually using real photos with pronounced shadows.
3 - How many lamps?
Again not an extremely difficult to explain task - given the structure of shadows estimate the number of sources of illumination. Say for simplicity initially we could restrict the number of source to a small number such as 1,2,3, up to say 5. Examples of rendered mockups below:
Clearly solving shadow attribution could simplify lamp counting, but to some degree it would be better to solve lamp counting before doing shadow attribution. Somehow human visual system is capable of doing both at once!
These three tasks are just a tip of the iceberg and it is fairly easy to come up with new ones (though creating a large dataset for benchmark will certainly not be so easy). One key feature of all those tasks is that they are relatively trivial to human vision, and yet nothing existing in the field of computer vision can get any close to solving them. These are actually very important tasks, solving which allows the agent to further disambiguate and decipher the visual scene. To paraphrase the thread above, a system that could solve these problems (better yet, a single system that could solve all three) would be "obviously more intelligent" than anything we currently have.
What we don't need?
I'll take the liberty of listing a few things I don't think we need to burn our GPU cycles on:
- Fake celebrities
- More ATARI playing reinforcement learning
- Additional 0.001% on MNIST
- Additional 0.1% on ImageNet with their top five methodology
- More style transfer demos
Some of these might be cool (and perhaps even have a niche application), but hey, seriously. This is what we are wasting our efforts if we can't solve any of the problems above?