Autopsy of a deep learning paper


I read a lot of deep learning papers, typically a few/week. I've read probably several thousands of papers. My general problem with papers in machine learning or deep learning is that often they sit in some strange no man's land between science and engineering, I call it "academic engineering". Let me describe what I mean:

  1. A scientific paper IMHO, should convey an idea that has the ability to explain something. For example a paper that proves a mathematical theorem, a paper that presents a model of some physical phenomenon. Alternatively a scientific paper could be experimental, where the result of an experiment tells us something fundamental about the reality. Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality.
  2. An engineering paper shows a method of solving a particular problem. Problems may vary and depend on an application, sometimes they could be really uninteresting and specific but nevertheless useful for somebody somewhere. For an engineering paper, things that matter are different than for a scientific paper: the universality of the solution may not be of paramount importance. What matters is that the solution works, could be practically implemented e.g. given available components, is cheaper or more energy efficient than other solutions and so on. The central point of an engineering paper is an application, and the rest is just a collection of ideas that allow to solve the application.

Machine learning sits somewhere in between. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

There is also the fourth kind of papers, which indeed contain an idea. The idea may even be useful, but it happens to be trivial. In order to cover up that embarrassing fact a heavy artillery of "academic engineering" is loaded again, such that overall the paper looks impressive.

This happens to be the case for "An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution", a recent piece from Uber AI labs, which I will dissect in detail below.


So let's jump into the content. Conveniently we don't even need to read the paper (though I did) and we can just watch a video the authors have uploaded to youtube:

I generally appreciate the clip, though the presentation format is a bit strange given the context. Particularly the blooper seems a bit out of line, but whatever. Perhaps I just can't get their sense of humor.

OK, so what do we have here: the central thesis of the paper is the observation that convolutional neural networks are not performing very well in tasks that require localization or more precisely, where the output label is a more or less direct function of the coordinate of some input entity and not any other property of that input. So far so good. Conv-nets indeed are not well equipped for that, as the collapsing hierarchy of feature maps modeled after the Neocognitron (itself modeled after the Ventral visual pathway) was almost designed to ignore where things are. Next the authors propose a solution: it is to add the coordinates as additional input maps in the convolutional layer.

Now that sounds extremely smart, but what the authors are in fact proposing is something any practitioner in the field would take for granted - they are adding a feature that is more suitable to decode the desired output. Anyone doing any practical work in computer vision sees nothing extraordinary in adding features, though it is a subject of a heated purely academic debate among some deep learning circles, where researchers totally detached from any practical application are arguing that we should only use learned features because that way is better. So perhaps it is not bad that deep learning crowd now begins to appreciate feature engineering...

Anyway, so they added a feature, explicit values of the coordinates. Now they are testing the performance on a toy dataset which they call "Not-so-Clevr" perhaps to outrun critics like me. So are their experiments clever? Let us see. One of the tasks at hand is the generation of one-hot image based on coordinates or generate the coordinates from one-hot image. They show that adding the coordinate to the convolutional net does improve the performance significantly. Perhaps this would be less shocking, if they'd sat down and instead of jumping straight to Tensorflow, they could realize that one can explicitly construct a neural net that will solve the one-hot to coordinate association without any training whatsoever. For that incredible task I will use three operations: convolution, non-linear activation and summation. Luckily all of those are basic components of a convolutional neural net, let's see:


import scipy.signal as sp
import numpy as np
# Fix some image dimensions
I_width = 100
I_height = 70
# Generate input image
# Generate random test position
pos_x = np.random.randint(0, I_width-1)
pos_y = np.random.randint(0, I_height-1)
# Put a pixel in a random test position
A[pos_y, pos_x]=1
# Create what will be the coordinate features
# Fill the X-coordinate value
for x in range(I_width):
   X[:,x] = x
# Fill the Y-coordinate value
for y in range(I_height):
   Y[y,:] = y
# Define the convolutional operators
op1 = np.array([[0, 0, 0],
                [0, -1, 0],
                [0, 0, 0]])

opx = np.array([[0, 0, 0],
                [0, I_width, 0],
                [0, 0, 0]])

opy = np.array([[0, 0, 0],
                [0, I_height, 0],
                [0, 0, 0]])
# Convolve to get the first feature map DY
CA0 = sp.convolve2d(A, opy, mode='same')
CY0 = sp.convolve2d(Y, op1, mode='same')
# Convolve to get the second feature map DX
CA1 = sp.convolve2d(A, opx, mode='same')
CX0 = sp.convolve2d(X, op1, mode='same')
# Apply half rectifying nonlinearity
# Subtract from a constant (extra layer with a bias unit)
# Check the result
assert(pos_x == int(result_x))
assert(pos_y == int(result_y))
print result_x 
print result_y

Lo and behold : one hot pixel bitmap to coordinate translation! [I leave the opposite translation from coordinate to one hot bitmap as an exercise to the reader] One convolutional layer, one non linear activation, one all to one summation and finally one subtraction from a constant. That is it. No learning, just fifty something lines of python code (with comments)... This task, given the coordinate feature is trivial! There is absolutely no wonder that this works. So far we have not gone beyond what a fresh student in ML 101 should be able to figure out. So they have to roll in the heavy artillery: GAN's.

Yeah, let's try this synthetic generation tasks on GAN's, one with the coordinate feature and one without. OK, at this point let me just scroll down, maybe there will be something else funny...

Oh there is, in the table in the appendix:

They tried this coordinate feature on ImageNet, adding it to the first layer of ResNet-50 network. I suppose the expectation here is that it will not make much difference, because the category readout in Image-Net is not a function of position (and if there is such a bias, data augmentation during training should remove it entirely). So they trained it using 100 GPUs (100 GPUs dear lord!), and got no difference until fourth decimal digit! 100 GPU's to get a difference on fourth decimal digit! I think somebody at Google of Facebook should reproduce this result using 10000 GPU's, perhaps they will get a difference at a third decimal digit. Or maybe not, but whatever, those GPU's need to do something right? For one thing, I think it would suffice to just say, that as expected this add-on does not make any difference on image classification tasks, where coordinate of the object is explicitly or implicitly de-correlated form the label. Secondly I think it would suffice to perform this exercise on a toy dataset (such as MNIST) at a fraction of the compute.


This paper is intriguing indeed. It exposes the shallowness of the current deep learning research masked by the ludicrous amount of compute. Why is Uber AI doing this? What is the point? I mean if these were a bunch of random students on some small university somewhere, then whatever [still would not make it a strong paper, but we all learn and we all have crappy papers on our records]. They did something, they wanted to go for a conference, fine. But Uber AI? I thought these guys were supposed to build a self driving car no? They should not be driven by the "publish or perish" paradigm? I'd imagine they could afford to go to any ML conference they'd like without having to squeeze out some mediocre result. Oh and what is even funnier than the paper itself are the praises of the acolytes ignorant of the triviality of this result. Read the forum below the post or on twitter to find some of those, even from some prominent DL researchers. They apparently have spent to much time staring into the progress bars on their GPU's to realize they are praising something obvious, obvious to the point that it can be constructed by hand in several lines of python code.


If you found an error, highlight it and press Shift + Enter or click here to inform us.