Give me a dataset to train, and I shall move the world

There is a widespread belief among the "artificial intelligentsia" that with the advent of deep learning all it takes to conquer some new land (application) is to create a relevant dataset, get a fast GPU and train, train, train. However, as it happens with complex matters, this approach has certain limitations and hidden assumptions. There are at least two important epistemological assumptions:

  1. Given big enough sample from some distribution we can approximate it efficiently with a statistical/connectionist model
  2. A statistical sample of a phenomenon is enough to automate/reason/predict the phenomenon

Both of these assumptions are not universally correct.

Universal approximation is not really universal

There is a theoretical result known as the universal approximation theorem. In summary it states that any function can be approximated to an arbitrary precision by (at least) three  level composition of real functions, such as e.g. a multilayer perceptron with sigmoidal activation. This is a mathematical statement, but a rather existential one. It does not say if such approximation would be practical or achievable with, say, gradient descent approach. It merely states that such approximation exists. As with many such existential arguments, their applicability to real world is limited. In the real world, we work with strictly finite (in fact even "small") systems that we simulate on the computer. Even though the models we exercise are "small", the parameter spaces are large enough, such that we cannot possibly search any significant portion of them. Instead we use gradient descent methods to actually find a solution, but by no means there are guarantees that we can find the best solution (or even good at all). Although deep learning is a powerful framework, it is relatively easy to construct example datasets which are going to be enormously difficult to approximate: departing from the famous two spiral problem, one can proceed further into fractal regions (nested spirals of all sizes and shapes), up to multidimensional fractals of arbitrary complexity. Such datasets may require horribly large number of hyperplanes to be separated with any reasonable accuracy. It is a question whether such extremely irregular regions apply to real world data, which in many cases seems smooth and almost "gaussian". My hunch is that plenty aspects of reality are in some ways rough like a fractal, at least at certain scales.

Central limit theorem has limits

The second issue  is that any time a dataset is created, it is assumed that the data contained in the set has a complete description of the phenomenon, and that there is a (high dimensional) distribution localised on some low dimensional manifold which captures the essence of the phenomenon. This strong belief is in part caused by the belief in the central limit theorem: a mathematical result that tells us something about averaging of random variables. The CLT goes approximately like this: take a sample of independent, identically distributed random variables with bounded variances and some expected values. The new random variable constructed as an arithmetic average of the said sample will converge to a Gaussian distribution in the limit of the size of the sample (in fact quite rapidly). Therefore, even though we don't know exactly what distributions generated individual variables, we know that in the limit, their average will have a Gaussian distribution. This is a powerful statement, as the Gaussian distribution is a very "nice" distribution. It decays very quickly and does not permit outliers (read: permits outliers with extremely low probability). Consequently the line of though goes: even though the world is complex and actions of individual entities may be extremely difficult to model, in the average things smooth out and become much more regular. This approach works in many cases: essentially everything reasonably approximated by equilibrium thermodynamics obeys. But there is a fine print: the theorem assumes that the original random variables had finite variance and expected values, were independent and identically distributed. Not all distributions have those properties!  In particular critical phenomena exhibit fat tail distributions which often have unbounded variance and undefined expected values (not to mention many real world samples are actually not independent or identically distributed). This is the realm of large-deviation statistics: where randomness is truly random! It happens that many phenomena seen in the biosphere appear to be critical and follow the fat tail statistics. These things don't average out. Their complexity stays the same, whether we take a single variables, average over many etc. To make things worse, critical phenomena may have long streaks of regular behaviour (generally behave like intermittently chaotic system), during which one can nicely fit a gaussian and build a strong belief that the phenomenon is nice and regular, until all of a sudden the mode changes and everything goes wild (see e.g the stock market and the black swan theory).

World is non-stationary

Aside from the fact that many real world phenomena may exhibit rather hairy fat tail distributions, there is another quirk which is often not captured in datasets: that phenomena are not always stationary. In other words, the world keeps evolving, changing the context and consequently altering fundamental statistical properties of many real world phenomena. Therefore a stationary distribution (a snapshot) at a given time may not work indefinitely. E.g. a statement often found these days in press: make a good enough dataset and AI will be better at detecting disease than doctors. This may work for some cases (there are already some cool examples), except any time a sudden shift in epidemiology happens and a new set of symptoms show up. Noteworthy, that happens all the time as viruses and bacteria keep evolving at fast pace, rendering past treatments or even diagnosis ineffective or even making diagnosis specific to a geographical region, particular race, religion or other aspects outside of idealised/avaraged clinical evaluation criteria. Even such things as personal assistants may be subject to this non-stationarity problem: people's habits change all the time with factors such as ageing, birth of children, change of job, disease, shift of personal relationships. These things happen all the time, potentially changing the daily/weekly/monthly patterns of behaviours. The pace at which things are changing may be to quick for statistical models to follow, even if they retrain online. Granted there are certainly aspects of behaviour and some humans which are very stable:  some people work in the same place for many years and have very stable patterns of behaviours. But this cannot be assumed in general.


In summary, datasets are useful proxies to reality, but they themselves are not the reality, with all the consequences. Whenever we work with AI/machine learning widget we should be cautious of that. Any time we train a network we ride on the two strong assumptions stated above. In some cases these assumptions may be valid but not always. The thin line separating these two regimes is exactly the line marking the success and failure of contemporary AI approaches.




Leave a Reply