Fat tails are weird

If you have taken a statistics class it may have included stuff like basic measure theory. Lebesgue measures and integrals and their relations to other means of integration. If your course was math heavy (like mine was) it may have included Carathéodory's extension theorem and even basics of operator theory on Hilbert spaces, Fourier transforms etc. Most of this mathematical tooling would be devoted to a proof of one of the most important theorems on which most of statistics is based - central limit theorem (CLT).

Central limit theorem states that for a broad class of what we in math call random variables (which represent realizations of some experiment which includes randomness), as long as they satisfy certain seemingly basic conditions, their average converges to a random variable of a particular type, one we call normal, or Gaussian.

The two conditions that these variables need to satisfy are that they are:

Independent
Have finite variance

In human language this means that individual random measurements (experiments) "don't know" anything about each other, and that each one of these measurements "most of the time" sits within a bounded range of values, as in it can actually be pretty much always "measured" with an apparatus with a finite scale of values. Both of these assumptions seem reasonable and general and we can quickly see where Gaussian distribution should start popping out.

Every time, we deal with large numbers of agents, readouts, measurements, which are not "connected to each other" we get a Gaussian. Like magic. And once we have a Normal distribution we can say some things about these population averages. Since Gaussian distribution is fully defined by just two numbers - mean and variance, we can, by collecting enough data rather precisely estimate these values. And once we estimated them we can start making predictions about e.g. the probability that a given sum or random variables will exceed some value. Almost all of what we call statistics is built on this foundation, various tests, models, etc. This is how we tame randomness. In fact often after finishing statistics course you may walk out thinking that Gaussian bell curve is really the only distribution that matters, and everything else are just some mathematical curiosities without any practical applications. This as we shall find out is a grave mistake.

Let's go back to the seemingly benign assumptions of CLT: we assume the variables are independent. But what does that exactly mean? Mathematically we simply wave our hands saying that the probability of a joint even of X and Y is a product of probabilities of X and Y. Which in other words means that the probability distribution of a joint event can be decomposed into projections of probability distributions of individual parts. From this follows that knowing the result of X gives us exactly zero information about the result of Y. Which among other things means X does not in any way affect Y, and moreover nothing else affects simultaneously X and Y. But in the real world, does this even happen?

Things become complicated, because in this strict mathematical sense, no two physical events that lie within each others light cones, or even in a common light cone of another event are technically "independent". They either in some capacity "know" about "each other" or they both "know" about some other event that took place in the past and potentially affected them both. In practice of course we ignore this. In most cases that "information" or "dependence" is so weak, CLT works perfectly fine and statisticians live to see another day. But how exactly robust CLT is if things are not exactly "independent"? That unfortunately is something many statistics courses don't teach or offer any intuitions.

So let's run a small experiment. Below I simulate 400x200=80000 independent pixels, each taking a random value between zero and one [code available here]. I average them out and plot a histogram below. Individual values or realizations are marked with red vertical line. We see CLT in action, a bell curve exactly as we expected!

Now let's modify this experiment just a tiny bit by adding a small random value to each one of these pixels (between -0.012, 0.012), simulating a weak external factor that affects them all. This small factor is negligible enough that it's hard to even notice any effect it may have on this field of pixels. But because CLT accumulates, even such tiny "common" bias has a devastating effect:

Immediately we see that samples are no longer Gaussian, we regularly see deviations way above 6-10 sigma which under Gaussian conditions should almost never happen. CLT is actually very very fragile for even slight amount of dependence. OK, but are there any consequences to that?

Big deal one might say, so if it's not Gaussian then it probably is some other distribution and we have equivalent tools for that case? Well... yes and no.

Other types of probability distributions have indeed been studied. Cauchy distribution [1] looks almost like the Gaussian bell curve, only has undefined variance and mean. But there are even versions of "CLT" for Cauchy distribution so one might think it really is just like a "less compact" Gaussian. This could not be further from the truth. These distributions we often refer to as "fat tails" [for putting much more "weight" in the tail of the distribution, as in outside of the "central" part] are really weird and problematic. Much of what we take for granted in "Gaussian" statistics does not hold and a bunch of weird properties connect these distributions to concepts such as complexity, fractals, intermittent chaos, strange attractors and ergodic dynamics in ways we don't really understand well yet.

For instance let's take a look at how averages behave for Gaussian, Cauchy and a Pareto distribution:

Note these are not samples, but progressively longer averages. Gaussian converges very quickly as expected. Cauchy never converges, but Pareto with alpha=1.5 does, albeit much much slower than a Gaussian. Going from 10 thousand to a million samples highlights more vividly what is going on:
So Cauchy varies so much that mean, literally the first moment of the distribution never converges. Think about it, after millions and millions of samples another single sample will come, which will shift the entire empirical average by a significant margin. And no matter how far you go with that, even after quadrillions of samples already "averaged" at some point just one sample will come so large as to be able to move that entire average by a large margin. This is a much much wilder behavior than anything Gaussian.

OK but does it really matter in practice? E.g. does it have any consequences for statistical methods such as neural networks and AI? To see if this is the case we can run a very simple experiment: we will chose samples from distributions centered at -1 and 1. We want to estimate a "middle point" which will best separate samples from these two distributions, granted there might be some overlap. We can easily see from the symmetry of the entire setup, that such best dividing point is at zero, but let's try to get it via iterative process based on samples. We will initiate our guess of the best separation value and with some decaying learning will in each step pull it towards a sample we got. If we chose equal number of samples from each distribution we expect this process to converge to the right value.

Will will repeat this experiment for two instances of Gaussian distribution and two instances of Cauchy as displayed below (notice how similar these distributions look like at first glance):

So first with Gaussians, we get exactly the result we expected:

But with Cauchy things are little more complicated:

The value of the iterative process converges eventually since we are constantly decaying the learning rate, but it converges to a completely arbitrary value! We can repeat this experiment multiple times and we will get a different value every time. Imagine now for a moment that what we observe here is a convergence process of a weight somewhere deep in some neural network, even though every time it converges to "something" (due to decreasing learning rate), that "something" is just about random and not optimal. If you prefer the language of energy landscape or error minimization, this situation corresponds to a very flat landscape where gradients are pretty much zero, and ultimate value of parameter depends mostly where it got tossed around by the samples that came before learning rate became very small.

Fine but do such distributions really exist in the real world? After all, statistics professor said pretty much everything is Gaussian. Let's get back to why we see Gaussian distribution anywhere at all? It's purely because of central limit theorem and the fact that we observe something that is a result of independent averaging of multiple random entities. Whether that be in social studies, physics of molecules, drift of the stock market, anything. And we know from the exercise above that CLT fails as soon as there is just a bit of dependence. So how that dependence comes about in the real world? Typically via a perturbation that is coming into the system from a different scale, either something big that changes everything or something small that changes everything.

So e.g. molecules of gas in a chamber will move with Maxwell-Boltzmann distribution (which you can think of as a clipped Gaussian), until a technician comes into the room and lets the gas out of the container, changing these motions entirely. Or a fire happens in a room below which injects thermal energy into the container, speeding up the molecules. Or a nuke blows up over the lab and evaporates the container along with its contents. Bottom line - in a complex, nonlinear reality we inhabit, Gaussian distribution happens for a while for certain systems, between "interventions" originating from "outside" of the system, either spatially "outside" or scalewise "outside" or both.

So the real "fat-tails" we see in the real world are somewhat more cunning than your regular simple Cauchy or Pareto. They can behave like Gaussians for periods, sometimes for years or decades. And then suddenly flip by 10 sigma, and either go completely berserk or start behaving Gaussian again. This is best seen in the stock market indices, since for long periods of time they are sums of relatively independent stocks, they behave more or less like Gaussian walks. Until some bank announces bankruptcy, investors panic and suddenly all stocks in the index are not only not independent but massively correlated. Same pattern applies elsewhere, weather, social systems, ecosystems, tectonic plates, avalanches. Almost nothing we experience is in an "equilibrium state", rather everything is in the process of finding next pseudo-stable state, at first very slowly and eventually super fast.

Living creatures found ways of navigating these constant state transitions (within limits obviously), but static artifacts - especially complicated ones - such as e.g. human made machines are typically fine only within a narrow range of conditions and require "adjustment" when conditions change. Society and market in general is constantly making these adjustments, with feedback on itself. We live in this huge river with pseudo-stable flow, but in reality the only thing never changing is this constant relaxation into new local energy minima.

This may sound somewhat philosophical, but it has quite practical consequences. E.g. the fact that there is no such thing as "general intelligence" without the ability to constantly and instantly learn and adjust. There is no such thing as a safe computer system without the ability to constantly update it and fix vulnerabilities as they are being found and exploited. There is no such thing as stable ecosystem where species live side by side in harmony. There is not such thing as optimal market with all arbitrages closed.

The aim of this post is to convey just how limited are purely statistical methods in a world build with complex nonlinear dynamics. But can there be another way? Absolutely, biology is a existential proof to that. And what confuses people is not that biology is not at all "statistical", certainly in some ways it is. But what you learn matters just as much as how you learn. For example take random number generators. You see a bunch of numbers, seemingly random looking, can you tell what generated them? Probably not. But now plot these numbers on a scatter plot previous versus next. This is sometimes referred to as spectral test and you are likely to suddenly see structure (at least for the basic linear congruence based random number generators). What did we just do? We defeated randomness by making an assumption that there is a dynamical process behind this sequence of numbers, that they originated not from a magic black box of randomness but that there is a dynamical relation between consecutive samples. We then proceed to discover that relationship. Similarly with machine learning, we can e.g. associate labels with static images, but that approach (albeit seemingly very successful) appears to saturate and give out a "tail" of silly mistakes. How to tackle that tail? Perhaps, and it's been my hypothesis all along in this blog, that much like with a spectral test you need to look at temporal relations between frames. After all everything we see is generated by a physical process that evolves according to laws of nature, not a magical black box flashing random images in front of out eyes.

I suspect that a system that "statistically learns world dynamics" will have a very surprising emergent properties just as the properties we see emerge in Large Language Models, which BTW attempt at learning the "dynamics of language". I expect e.g. single shot abilities to emerge etc. But that is a story for another post.

If you found an error, highlight it and press Shift + Enter or click here to inform us.

Comments

comments

Piekniewski's blog

On limits of deep learning and where to go next with AI.

Fat tails are weird

Related

Comments

Share this:

Related

Comments