Autopsy of a deep learning paper

Introduction

I read a lot of deep learning papers, typically a few/week. I've read probably several thousands of papers. My general problem with papers in machine learning or deep learning is that often they sit in some strange no man's land between science and engineering, I call it "academic engineering". Let me describe what I mean:

  1. A scientific paper IMHO, should convey an idea that has the ability to explain something. For example a paper that proves a mathematical theorem, a paper that presents a model of some physical phenomenon. Alternatively a scientific paper could be experimental, where the result of an experiment tells us something fundamental about the reality. Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality.
  2. An engineering paper shows a method of solving a particular problem. Problems may vary and depend on an application, sometimes they could be really uninteresting and specific but nevertheless useful for somebody somewhere. For an engineering paper, things that matter are different than for a scientific paper: the universality of the solution may not be of paramount importance. What matters is that the solution works, could be practically implemented e.g. given available components, is cheaper or more energy efficient than other solutions and so on. The central point of an engineering paper is an application, and the rest is just a collection of ideas that allow to solve the application.

Machine learning sits somewhere in between. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

There is also the fourth kind of papers, which indeed contain an idea. The idea may even be useful, but it happens to be trivial. In order to cover up that embarrassing fact a heavy artillery of "academic engineering" is loaded again, such that overall the paper looks impressive.

This happens to be the case for "An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution", a recent piece from Uber AI labs, which I will dissect in detail below.

Autopsy

So let's jump into the content. Conveniently we don't even need to read the paper (though I did) and we can just watch a video the authors have uploaded to youtube:

I generally appreciate the clip, though the presentation format is a bit strange given the context. Particularly the blooper seems a bit out of line, but whatever. Perhaps I just can't get their sense of humor.

OK, so what do we have here: the central thesis of the paper is the observation that convolutional neural networks are not performing very well in tasks that require localization or more precisely, where the output label is a more or less direct function of the coordinate of some input entity and not any other property of that input. So far so good. Conv-nets indeed are not well equipped for that, as the collapsing hierarchy of feature maps modeled after the Neocognitron (itself modeled after the Ventral visual pathway) was almost designed to ignore where things are. Next the authors propose a solution: it is to add the coordinates as additional input maps in the convolutional layer.

Now that sounds extremely smart, but what the authors are in fact proposing is something any practitioner in the field would take for granted - they are adding a feature that is more suitable to decode the desired output. Anyone doing any practical work in computer vision sees nothing extraordinary in adding features, though it is a subject of a heated purely academic debate among some deep learning circles, where researchers totally detached from any practical application are arguing that we should only use learned features because that way is better. So perhaps it is not bad that deep learning crowd now begins to appreciate feature engineering...

Anyway, so they added a feature, explicit values of the coordinates. Now they are testing the performance on a toy dataset which they call "Not-so-Clevr" perhaps to outrun critics like me. So are their experiments clever? Let us see. One of the tasks at hand is the generation of one-hot image based on coordinates or generate the coordinates from one-hot image. They show that adding the coordinate to the convolutional net does improve the performance significantly. Perhaps this would be less shocking, if they'd sat down and instead of jumping straight to Tensorflow, they could realize that one can explicitly construct a neural net that will solve the one-hot to coordinate association without any training whatsoever. For that incredible task I will use three operations: convolution, non-linear activation and summation. Luckily all of those are basic components of a convolutional neural net, let's see:

 

import scipy.signal as sp
import numpy as np
# Fix some image dimensions
I_width = 100
I_height = 70
# Generate input image
A=np.zeros((I_height,I_width))
# Generate random test position
pos_x = np.random.randint(0, I_width-1)
pos_y = np.random.randint(0, I_height-1)
# Put a pixel in a random test position
A[pos_y, pos_x]=1
# Create what will be the coordinate features
X=np.zeros_like(A)
Y=np.zeros_like(A)
# Fill the X-coordinate value
for x in range(I_width):
   X[:,x] = x
# Fill the Y-coordinate value
for y in range(I_height):
   Y[y,:] = y
# Define the convolutional operators
op1 = np.array([[0, 0, 0],
                [0, -1, 0],
                [0, 0, 0]])

opx = np.array([[0, 0, 0],
                [0, I_width, 0],
                [0, 0, 0]])

opy = np.array([[0, 0, 0],
                [0, I_height, 0],
                [0, 0, 0]])
# Convolve to get the first feature map DY
CA0 = sp.convolve2d(A, opy, mode='same')
CY0 = sp.convolve2d(Y, op1, mode='same')
DY=CA0+CY0
# Convolve to get the second feature map DX
CA1 = sp.convolve2d(A, opx, mode='same')
CX0 = sp.convolve2d(X, op1, mode='same')
DX=CA1+CX0
# Apply half rectifying nonlinearity
DX[np.where(DX<0)]=0
DY[np.where(DY<0)]=0
# Subtract from a constant (extra layer with a bias unit)
result_y=I_height-DY.sum()
result_x=I_width-DX.sum()
# Check the result
assert(pos_x == int(result_x))
assert(pos_y == int(result_y))
print result_x 
print result_y

Lo and behold : one hot pixel bitmap to coordinate translation! [I leave the opposite translation from coordinate to one hot bitmap as an exercise to the reader] One convolutional layer, one non linear activation, one all to one summation and finally one subtraction from a constant. That is it. No learning, just fifty something lines of python code (with comments)... This task, given the coordinate feature is trivial! There is absolutely no wonder that this works. So far we have not gone beyond what a fresh student in ML 101 should be able to figure out. So they have to roll in the heavy artillery: GAN's.

Yeah, let's try this synthetic generation tasks on GAN's, one with the coordinate feature and one without. OK, at this point let me just scroll down, maybe there will be something else funny...

Oh there is, in the table in the appendix:

They tried this coordinate feature on ImageNet, adding it to the first layer of ResNet-50 network. I suppose the expectation here is that it will not make much difference, because the category readout in Image-Net is not a function of position (and if there is such a bias, data augmentation during training should remove it entirely). So they trained it using 100 GPUs (100 GPUs dear lord!), and got no difference until fourth decimal digit! 100 GPU's to get a difference on fourth decimal digit! I think somebody at Google of Facebook should reproduce this result using 10000 GPU's, perhaps they will get a difference at a third decimal digit. Or maybe not, but whatever, those GPU's need to do something right? For one thing, I think it would suffice to just say, that as expected this add-on does not make any difference on image classification tasks, where coordinate of the object is explicitly or implicitly de-correlated form the label. Secondly I think it would suffice to perform this exercise on a toy dataset (such as MNIST) at a fraction of the compute.

Summary

This paper is intriguing indeed. It exposes the shallowness of the current deep learning research masked by the ludicrous amount of compute. Why is Uber AI doing this? What is the point? I mean if these were a bunch of random students on some small university somewhere, then whatever [still would not make it a strong paper, but we all learn and we all have crappy papers on our records]. They did something, they wanted to go for a conference, fine. But Uber AI? I thought these guys were supposed to build a self driving car no? They should not be driven by the "publish or perish" paradigm? I'd imagine they could afford to go to any ML conference they'd like without having to squeeze out some mediocre result. Oh and what is even funnier than the paper itself are the praises of the acolytes ignorant of the triviality of this result. Read the forum below the post https://eng.uber.com/coordconv/ or on twitter to find some of those, even from some prominent DL researchers. They apparently have spent to much time staring into the progress bars on their GPU's to realize they are praising something obvious, obvious to the point that it can be constructed by hand in several lines of python code.

 

Comments

comments

25 thoughts on “Autopsy of a deep learning paper

    1. In science it is crucial to make many failed approaches, not only approaches of things that we are sure they work. So yes, it's good that they burnt 100 GPUs on a problem that didn't work. And in fact it is much better standard than most deep learning papers I read, when they focus mostly or only on problems in which architecture is better.

      Plus, it works for object detection, so it's not a "MNIST-only trick".

      1. I don't think that reporting failure here is the behavior in question in this post. I agree, that there should be more of this in general. What is in question is whether this particular idea requires that sort of testing. There are infinite number of ideas and we obviously cannot afford to test all of them. We usually use our judgement to make selections and prune out that paths that are not worthy of investigation. I could think of many more productive ways to put those 100 GPU's to work. That is my only point here really.

  1. @David Morris - Isn't the idea that convolutional neural networks are effectively designed to ignore location information, so when you add in a location feature, it should be unsurprising that performance is unchanged, no?

  2. I think the bit about "perhaps it is not bad that deep learning crowd now begins to appreciate feature engineering" hits the nail on the head, but also exposes the fallacy with this post.

    The whole point of deep learning, as it is used on a day-to-day basis, is that it is *supposed* to mostly take care of feature engineering for you. You take a bunch of standard components (convolutional, softmax, drop-out, etc), plug them together like lego blocks, tweak the hyperparameters, train the network, and (hopefully) out pops a system that can detect hotdogs.

    There isn't a step in that process for "consider whether we need to add a dozen extra channels of information because our standard components suck". And, for routine use cases, there shouldn't be the *need* for such a step. Again, that's the point of deep learning.

    So, on identifying a situation in which the built-in behaviour of a standard component is inadequate, the Uber team have instead suggested a *new* standard component that fixes the problem. They then demonstrated that it does actually fix the problem for a toy example, and (tried to) demonstrate that it can have benefits even for more complicated cases. Once Keras' API catches up, you'll probably be able to replace all your convolutional layers with CoordConv layers using a single search/replace - no extra thought required.

    Basically, they're replacing a green lego block with a yellow lego block, whereas you seem to be annoyed that they're not switching to meccano.

    Agreed that the ImageNet results are weird though...

    1. I don't see any fallacy. The only fallacy was to assume that this coordinate encoding feature was anything new or nontrivial (positional encoding was used eons ago). The main fallacy of the field is the religious assumptions is that features should be learned because you should not worry about it. I find this a null argument. If you know you can construct a feature that allows for better readout, do it. If your application demand's it do it. If you cannot engineer one, learn one, sure. There is no question here. Deep learning vs feature engineering is a false dichotomy, what matters is the performance in an application. Not every feature could be learned (as the paper discussed here shows, in the framework of conv-nets coordinate feature is hard or impossible to learn), therefore there is clearly a need in equipping DL models with extra channels of information. And in fact the coordinate feature can be very useful, that is not my point here. My point is that it is trivial, and should perhaps be a footnote in a paper that focuses on some application that needs it (as it BTW has before), rather than become a full blown paper and a pompous blog post.

      1. I agree with most of what you're saying. I guess my counterpoint is that they're not just saying "this is a new approach that worked well on this one project" (in which case I'd agree that this is footnote-worthy). Rather, they're saying "this should be the new standard approach". Trivial changes to a default behaviour aren't trivial.

        When someone decides to use LSTM instead of a vanilla RNN in their machine translation project, that's a footnote. (A fairly large footnote, admittedly.) When the entire deep learning community collectively decides that RNN means LSTM unless you specifically state otherwise (which is pretty much where we're at right now), that's a much more involved debate.

        Actually, per your comments on their GAN measurements, the discussion probably needs a lot *more* bikeshedding than has been done here. But, to get to a point where people talking about CNNs just automatically assume that you're referring to the CoordConv variant (assuming that does indeed turn out to be a sensible default!), there needs to be a first paper to kick things off. That's what they've done here.

        ...Or I guess Mark Cheshier could be right that they're being judged on paper count. One or the other.

        PS The reference to Keras implementation was just intended as an indicator of mainstreaming. I'm aware that this is trivial to implement if you're willing to do anything other than bang lego blocks together, although I did appreciate the specifics you gave.

    2. Oh and BTW: "Once Keras' API catches up, you'll probably be able to replace all your convolutional layers with CoordConv layers using a single search/replace - no extra thought required."

      First of all, dude, you don't need to wait for Keras to catch up. It is a matter of adding two extra CONSTANT pre computed input channels to any input tensor. This is totally possible in Keras today.
      Secondly, you only need it at the input layer, since further on these coordinated either get picked up and transformed with all the other features or ignored. So technically it is just a matter of replacing your WxHx3 RGB image at the input with WxHx5 tensor, with [:,:,3] and [:,:,4] being the coordinates, constants over all images. So again, it is trivial to do in any existing framework.

  3. My guess is that the personnel involved had promotion/performance concerns. Some sort of 'smart goal' that they should publish x papers in y time. Perhaps we should train models to replace HR. 🙂

  4. I guess there are two possiblities. 1) The authors were aware of the triviality of the problem and decided to ignore it and push for a paper, thus falling into the perils you described. 2) The authors were unaware of the triviality, and honestrly tried hard to explore all the areas of their ideas they could think of.

    The main problem might be something larger. Deep learning is becoming such a specialized field, that people working on it can't think about problems in terms other than deep learning. "If you only have a hammer, every problem is a nail" kinda thing.

    Furthermore, this is why peer review is important, and why we need to trust journals and (some) conferences way more than arxiv posts. Your comment here would have been a perhap review.

    1. I don't think peer review works particularly well. Let's face it: papers are often given for a review to PhD students. This stuff is done for free and you get what you pay for. Additionally, reviews are typically done within the same "community" and that facilitates forming cliques where such trivia flourishes because everyone is blind to it. Peer review is perhaps some indication to a lay man that the stuff is legit, but to a peer scientists it should not mean anything. Like stock market recommendations, perhaps maybe of some use to retail investors but not to serious traders who do their own research.

      What I do believe in is strong personal bullshit detectors. People should not take papers at their face value and do their own research (like I did in this case and many other examples). In fact I could write at least several similar posts about other papers but never had time to do it. This one just came as a particularly stark example.

      And BTW, I have nothing personal against the authors. As far as I'm concerned their next paper could be the best piece is research in the world. I don't think the authors consciously decided to publish trivial result, my suspicion is they just sat for too long in an echo chamber which reinforced this idea rather than provide for a fair criticism. Echo chambers and cliques are the biggest threats to science.

      1. I agree that echo chambers and clique are very dangerous. But I don't think there is a simple way to avoid it. By definition, peer review has to be done by experts, or at least people that understand enough about the topic and methods. And chances are, that these people are already part of the community, so they are victims of the same echo chambers and cliques.

        Ideally, it would be really helpful to have public reviews, where people like you could post comments on papers. That way, even if the peer review missed something, it can still be exposed. However, I don't know how many able people would be interested in such efforts.

        Still, I really appreciate your take on the paper and I hope to read similar thoughts on other papers.

  5. I think your critique is well taken. However, I’m honestly confused by the complete simplicity of this problem that I can’t even appreciate your solution requiring a neural net approach. Can you explain what solving this problem by a neural net expands upon a simple method that would simply do a find(x,y==1)?

    1. Yes, obviously in this toy setting the problem is absolutely trivial and does not need any sort of neural net to be solved. However in more sophisticated problems that would not be as easily solvable, adding such feature to a neural net could be beneficial. E.g. in some autonomy related context particular stimuli might have a different interpretation when it is in the center of the visual field and an alternative interpretation if seen in a periphery. Perhaps e.g. you want to pay attention to traffic signals in front of you, but ignore those seen in periphery. In such cases such localizing features could potentially be useful.

      This paper would have been much better if it explored in depth one such application instead of fiddling with toy problems or problems where localization is explicitly de-correlated from the label, such as ImageNet.

  6. I think this type of works are exploratory, they do not neccessarily introduce a new concept, but spark a debate about a subject in the community that still worth discussion. If the community regard it as too trivial then probably there would not be that much discussion. If community don't consider it as trivial so probably needs to think more about this subject. At the end of the day, Its all about communication. Of course, your article was insightful in this line.

    1. Fundamentally I agree with your comment. The problem is though, that these "AI" papers had become PR bombs for the companies involved, which makes what should be low key scientific discourse more difficult. Papers are written in a victorious tone, claims are often quite outrageous, and that does not help. I feel these guys had been forced to publish something by some exec who demanded an AI PR piece no matter what. Again, I seriously don't want to be too snarky and have nothing personal against these guys, but when I see some big pieces missing from these papers I will react.

Leave a Reply