The meta-parameter slot machine

Today we'll step back a bit and consider the psychology of a machine learning researcher when he does his job, a subject which interests me deeply and one that I've already touched in another post.  Some of this comes from my own introspection, as I've been doing machine learning for quite a few years now.

Emails and ML models trigger dopamine

It is a well known fact from biology that little achievements trigger the release of small amounts of dopamine - a neurotransmitter that is believed to be involved in reinforcement learning. The dopamine makes us feel good and also triggers plasticity in certain parts of the brain (likely allowing the brain to "remember" what behaviour lead to the reward). Reinforcement learning however has its issues, since the reward can appear by coincidence and therefore reinforce the "wrong cause". This is very much visible these days with Internet, emails and texts: since receiving an important and rewarding message reinforces the behaviour which lead to it - and that most likely was pressing "get mail" button - we get addicted to checking email! Same applies to social media, texting, and is also the mechanism underlying gambling. In reality rewards are sparse, therefore the reinforcement is pretty strong when we happen to get one. In fact it is disproportionately strong compared to all the situations when we do not receive any reward. Hence even when gambling, we win rarely (same with those amazing emails we get that we wait for all the time); when we occasionally do, it is enough to trigger the addiction.

Gambling in a multidimensional space

Let's get back to training black boxes, which is the stuff we do in the field called machine learning, or more broadly in computer modeling.  There are generally two ways to proceed:

  • If we understand the system we are dealing with (the equations are known) and we measured all the needed constants/initial conditions, we put all of that into a computer, make sure there are no bugs and calculate the result. Done.
  • If we don't really understand the system we can still try to make progress by optimizing parameters and simulating until it works. And this is a dangerous path.

The second approach is very reminiscent to playing on a slot machine. We look at the result, but because the stuff we try to model is typically extremely complex, we can't really understand how the meta-parameters are affecting the system. Ultimately we pick some random gradient (or even completely random set of parameters), press the lever and wait. Those with more resources can explore multiple configurations at once with parallel simulation (which is like playing an entire row of slot machines at once).

Then, perhaps by chance we get to a set of parameters that "looks promising". Here is the reward signal. Most likely spurious, but the cause that gets rewarded is clear: running more models. Before we know, the researcher is addicted to running simulations and like any other addict he confabulates on why this is great and moves humanity forward.

Let me test this hypothesis just once more

The researcher is either a PhD student, postdoc or an industrial engineer, and desperately needs to get a result from the data they have. Once the meta-tuning addiction is in place, a cascade of events follows: since running models is so addicting and brings pleasure any time things get even a tiny bit better, the researcher (victim of the nonlinear parameter cocaine) will do everything to justify his behaviour. The best way to do so is to show the results, preferably on an "established set".  Objectively good results are rare. Things could be however helped a bit by doing a few shady things, the repertoire of which includes:

  • take a different dataset where things are simpler
  • take a subset of the dataset where some spurious correlation happens to work (can be automated)
  • insert some priors into the model about the data/reality that is not contained in the data itself but comes directly from researchers head. (see outside the box post)
  • test hypothesis again and again without any regard for Bonferroni corrections
  • cheat straight up by manufacturing result altogether

Ultimately, as a result of this process we get a nice looking paper, with an incremental result on a nice and established dataset, together with an algorithm equipped with tens of ad hoc hacks that make it work. The algorithm is full of seemingly arbitrary choices (which in fact are a result of iterative trial and error) without any deeper underlying idea. If we are lucky we shall get the source code which allows to reproduce the results (this is good as it may indicate the researcher is not just shamelessly cheating). All the great results and measurements may in fact be hiding the fact that the paper does not really have any new "idea". The less the idea the more emphasis on the measurement result.

Measurable progress may not mean anything

People trapped in the optimization casino will often say: but we are making a measurable progress towards our goal! That is very vividly visible these days with all the companies developing self driving cars and their swelling PR propaganda on how many miles can they drive without a human intervention. Enough said: a very significant and measurable progress towards getting man to the Moon was made around 399BC with the invention of the catapult. Later on, another major, measurable step was made at some point in the 12th century with the invention of a cannon. Since then a steady progress was made, especially in XIX and early XX century with cannons shooting higher and further. Needless to say it was neither a cannon nor a catapult that eventually got man to the Moon.


Personally I think, that if an idea is good then it should be expressible in a few short sentences. If the explanation is that this "franken-model" happens to work on dataset A (in machine learning) or this "franken-model" explains experiment B (e.g. lot of papers is neuroscience), then there is a strong possibility that the work is not going to hold water in the long run. This to some degree is restatement of Occam's razor: if something works we should be able to understand why and it should be the simplest of all entities that happen to work. The idea of the back-propagation algorithm is good because it follows a simple principle and does not need a dataset to be justified. The idea of a convnet is good because it is simple and happens to work for many things. The idea of adversarial training is good, because it could be expressed easily and does not need a dataset to make sense. Similarly the idea behind the PVM is to build a machine learning system which via training by prediction could simulate observed physics in its fabric - simple.

Ideas such as:  a conv-net with 25 specifically defined layers, followed by a softmax layer, 5 types of regularisation and a specific learning rate update scheme that happens to work great on dataset A (but perhaps not on anything else) is likely snake oil (unless dataset A is exactly what everyone desperately needs).

With all that said, I acknowledge that every once in a while we have to step into the optimization casino and run some one armed bandits. But much like with a real casino one has to be careful of the deceptive nature of this process and be able to stop before it is too late. So to summarize (on a funny note) remember this: doing AI may be harmful to your own brain!

If you found an error, highlight it and press Shift + Enter or click here to inform us.