What actually is statistics?

In the modern era of computers and data science, there is a ton of things discussed that are of "statistical" nature. Data science essentially is glorified statistics with a computer, AI is deeply statistical at its very core, we use statistical analysis for pretty much everything from economy to biology. But what actually is it? What exactly does it mean that something is statistical? 

The short story of statistics

I don't want to get into the history of statistical studies, but rather take a birds eye view on the topic. Let's start with a basic fact: we live in a complex world which provides to us various signals. We tend to conceptualize these signals as mathematical functions. A function is the most basic way of representing a fact that some value changes with some argument (typically time in physical world). We observe these signals and try to predict them. Why do we want to predict them? Because if we can predict a future evolution of some physical system, we can position ourselves to extract energy from it when that prediction turns out accurate [but this is a story for a whole other post]. This is very fundamental, but in principle this could mean many things: an Egyptian farmer can build irrigation systems to improve crop output based on predicting the level of the Nile, a trader can predict price action of a security to increase their wealth and so on, you get the idea. 

Perhaps not entirely appreciated is the fact that the physical reality we inhabit is complex, and hence the nature of the various signals we may try to predict varies widely. So let's roughly sketch out the basic types of signals/systems we may deal with

Types of signals in the world

Some signals originate from physical systems which can be isolated from all the rest and reproduced. These are in a way the simplest (although not necessarily simple). This is the type of signals we can readily study in the lab and in many cases we can describe the "mechanism" that generates them. We can model such mechanisms in the form of equations, and we might refer to such equations as describing the "dynamics" of such system. Pretty much everything that we would call today as classical physics is a set of formal descriptions of such systems. And even though such signals are in the minority of everything that we have to deal with, ability to predict them allowed us to build a technical civilization, so this is a big deal. 

But many other signals that we may want to study are not like that, for numerous reasons. For example we may study a signal from a system we cannot directly observe or reproduce. We may observe a signal from a system we cannot isolate from other subsystems. Or we may observe a signal which is influenced by some many individual parts and feedback loops, that we can't possibly ever dream to observe all the individual sub-states. That is where statistics comes in.

Statistics is a craft that allows us to analyze and predict certain subset of complex signals that are not possible to describe in terms of dynamics. But not all of them! In fact, very few. In very specific circumstances. Statistics is the ability to recognize if these assumptions are indeed valid in the case we'd like to study and if so, to what degree can we gain confidence that a given signal has certain properties. 

Now let me repeat this once again: statistics can be applied to some data sometimes. Not all data always. Yes you can apply statistical tools to everything, but more often than not the results you will get will be garbage. And I think this is a major problem with todays "data science". We teach people everything about how to use these tools, how to implement them in python, this library, that library, but we don't ever teach them that first, basic evaluation - will statistical method be effective for my case?

So what are these assumptions? Well that is all the fine print in individual theories or statistical tests that we'd like to use, but let me sketch out the most basic: central limit theorem. We observe the following:

  • when our observable (signal, function) is produced as a result of averaging multiple "smaller" signals,
  • and these smaller signals are "independent" of each other
  • and these signals themselves vary in a bounded range

then the function we observe, even though we might not be able to predict exact values, will generally fit in that we call a Gaussian distribution. And with that, we can quantitatively describe the behavior of such function by giving two numbers - the mean value and the standard deviation (or variance). 

I don't want to go into the details of what exactly you can do with such variables, since basically any statistical course will be all about that, but I want to highlight a few cases when central limit theorem does not hold:

  • when the "smaller" signals are not independent - which to some degree is always the case. Nothing within a single light cone is ever entirely independent. So for all practical purposes, we have to get the feel of how "independent" the individual building blocks of our signal really are. Also the smaller signals can be reasonably "independent" of each other, but can all be dependent on some other bigger external thing. 
  • when the smaller signals do not have a bounded variance. And in particular it is enough, that only one of millions of smaller signals we may be averaging may have an unbounded variance, and already all this analysis can be dead on arrival. 

Now there are some more sophisticated statistical tools that allow us to have some weaker theories/tests when some weaker assumptions are met, let's not get into the details of that too much to not lose the track of the main point. There are signals which appear not to satisfy any even the weaker assumptions, and yet we tend to apply statistical methods to them too. This is the entire work of Nicholas Nassim Taleb, particularly in the context of stock market.

I've been making a similar point in this blog, that we make the same mistake with certain AI contraptions by training them on data on which in principle they cannot "infer" the meaningful solution and yet we celebrate the apparent success of such methods, only to find out they suddenly fail in bizarre ways. This is really the same problem - application of essentially statistical system to a problem which does not satisfy the conditions to be statistically solvable. In these complex cases e.g. with computer vision it is often hard to judge which exactly problem will be solvable by some sort of regression, or not.

There is an additional finer point I'd like to make: whether a problem will be solvable by say a neural network obviously also depends on the "expressive power" of the network. Recurrent networks that can build "memory" will be able to internally implement certain aspects of "mechanics" of the problem at hand. More recurrence and more complex problems can in principle be tackled (though there could be other problems such as e.g. training speed etc).

A high dimensional signal such as a visual stream will be a composition of all sorts of signals, some of them fully mechanistic in origin, some of them stochastic (perhaps even Gaussian), and some wild fat tailed chaotic signals, and similarly to stock market, certain signals can be dominant for prolonged periods of time to fool us into thinking that our toolkit works. Stock market e.g. for the majority of the time behaves like a Gaussian random walk, but every now and then it jumps by multiple standard deviations, because what used to be a sum of roughly independent individual stock prices, suddenly gets super dependent on a single important signal such as breakout of a war or unexpected bankruptcy of a big bank. Similarly with systems such as self driving cars, they may behave quite well for miles until they get exposed to something never seen and will fail since e.g. they only applied statistics to what can be understood with mechanics but at a slightly higher level of organization. Which is another point that makes everything even more complex: signals which on one level appear completely random, can in fact be rather simple and mechanistic at a higher level of abstraction. And vice versa - averages of what in principle are mechanistic signals can suddenly become chaotic nightmares. 

We can build more sophisticated models of data (whether manually as a data scientist or automatically as part of training a machine learning system), but we need to be cognizant of these dangers.

And we also so far have not created anything that would have the capacity of learning both the mechanics and statistics of the world on multiple levels as the brain does (not necessarily human brain, any brain really). Now I don't think brains can generally characterize any chaotic signal, and make mistakes too, but they are still ridiculously good at inferring "what is going on" especially in the scale to which they evolved to inhabit (obviously we have much weaker "intuitions" at scales much larger or much smaller, much shorter or much longer to what we typically experience).  But that is a story for another post. 

If you found an error, highlight it and press Shift + Enter or click here to inform us.