Every rule of thumb in data science has a counterexample. Including this one.
In this post I'd like to explore several simple and low dimensional examples that expose how our typical intuitions about the geometry of data may be fatally flawed. This is generally a practical post, focused on examples, but there is a subtle message I'd like to provide. In essence: be careful. It is easy to make data based conclusions which are totally wrong.
Dimensionality reduction is not always a good idea
It is a fairly common practice to reduce the input data dimension via some projection, typically via principal component analysis (PCA) to get a lower-dimensional, more "condensed" data. This often works fine, as often the directions along which data is separable align with the principal axis. But this does not have to be the case, see a synthetic example below:
from sklearn.neural_network import MLPClassifier from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from scipy.stats import ortho_group from mpl_toolkits.mplot3d import Axes3D import numpy as np import matplotlib.pyplot as plt N = 10 # Dimension of the data M = 500 # Number of samples # Random rotation matrix R = ortho_group.rvs(dim=N) # Data variances variances = np.sort(np.random.rand((N)))[::-1] # We want the variance on the last dimension be an order of magnitude lower than # any next smallest variance variances[-1] = 0.1 * variances[-2] # Set random means (centers) means = np.random.rand((N)) # Generate first part of data (to be labeled as zero) DataA = np.random.randn(M, N) for i in range(M): # Multiply random gaussian vector by variances and shift by means # Next rotate by unitary matrix R to "hide" the principle components DataA[i, :] = np.matmul(R, (means + np.multiply(DataA[i, :], variances))) shift = 10 * variances[-1] means[-1] += shift # Next bit of data will be shifted by 10 * variance in the last dimension DataB = np.random.randn(M, N) for i in range(M): # Multiply random gaussian vector by variances and shift by means # Next rotate by unitary matrix R to "hide" the principle components DataB[i, :] = np.matmul(R, (means + np.multiply(DataB[i, :], variances))) Data = np.concatenate((DataA, DataB), axis=0) Label = np.zeros(M*2) Label[M:] = 1 # This data is very well separable on the dimension of the smallest variance # and essentially inseparable in all other dimensions # Permute the data P = np.random.permutation(M*2) Data_shuffled = Data[P, :] Label_shuffled = Label[P] # Separate into training and testing set Training_set = Data_shuffled[:int(0.8*2*M), :] Testing_set = Data_shuffled[int(0.8*2*M):, :] Training_label = Label_shuffled[:int(0.8*2*M)] Testing_label = Label_shuffled[int(0.8*2*M):] # Scaler will scale data to have zero mean and unit variance scaler = StandardScaler() scaler.fit(Training_set) Training_set = scaler.transform(Training_set) Testing_set = scaler.transform(Testing_set) # Multilayer perceptron with some hidden layers clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 4), random_state=1, max_iter=1, warm_start=True) # Train for i in range(500): clf.fit(Training_set, Training_label, ) # Evaluate performance = 0 for x in range(Testing_label.shape): prediction = clf.predict([Testing_set[x, :]]) if np.abs(Testing_label[x]-prediction) < 0.01: performance += 1 print("Performance %2.2f" % ((100.0*performance)/Testing_label.shape)) # Performance should be good if N == 3: # Plot if data is low dimensional fig = plt.figure() ax = fig.gca(projection='3d') ax.scatter(Training_set[:,0], Training_set[:, 1], Training_set[:, 2], c=Training_label) plt.show() # Decompose data into principle components pca = PCA(n_components=N - 1) pca.fit(Training_set) Training_set_pca = pca.transform(Training_set) Testing_set_pca = pca.transform(Testing_set) # Scale it like previously scaler2 = StandardScaler() scaler2.fit(Training_set_pca) Training_set_pca = scaler2.transform(Training_set_pca) Testing_set_pca = scaler2.transform(Testing_set_pca) # Another such Multilayer perceptron with some hidden layers clf1 = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 4), random_state=1, max_iter=1, warm_start=True) # Train for i in range(500): clf1.fit(Training_set_pca, Training_label, ) # Evaluate performance = 0 for x in range(Testing_label.shape): prediction = clf1.predict([Testing_set_pca[x, :]]) if np.abs(Testing_label[x]-prediction) < 0.01: performance += 1 # Performance most likely sucks! print("Performance %2.2f" % ((100.0*performance)/Testing_label.shape)) if N == 3: # Plot if low dimensional fig = plt.figure() ax = fig.gca(projection='3d') ax.scatter(Training_set[:, 0], Training_set_pca[:, 1], c=Training_label) plt.show()
We can run this several times to get an idea of what is going on, usually the results will be close to 100% performance for the first classifier and in the low 50% for the second one (essentially baseline chance). If we set N=3 we can plot the data in 3d. The original data looks somewhat like this:
We see that when we rotate the data we find a direction along which the data is separable, there are two well defined pancakes, no wonder the perceptron had no problems finding that separation. However if we look at the data after PCA:
and we find that the data is completely mixed up and inseparable. Now if you run the above code several times you may find that sometimes the PCA does not converge and maintains some leftovers of the separable dimension but most of the time it is a complete flop. This data is explicitly constructed to make the PCA fail but this does happen on real data. And note: we only eliminated a single component with lowest variance. In many real world cases researchers reduce 90% or more of the components!
Average is only an average
People often bring averages in their debates. However an average is a very impoverished representation of data and may hide-away a huge amount of important information. Average is easily skewed by outliers and hence using averages for long tail distributions is very deceptive. For example in the case of distribution of income, or general pricing of any asset it is better to use the median which is vastly more stable.
But that is not the only thing that can be deceptive about averages, as they may indicate biases that are actually a lot more complex for data with multiple attributes. To show an example, let's take a dataset with two attributes: species and height. Let's say that we will have two species K - Klingons and R - Romulans. Each will have an extra attribute T - tall and S for short. We will analyze a third value which assigns to every one of our samples a scalar, let's call it performance in "parisses squares".
These 20 rows will be enough to make my point. What can we tell from this data? Let's take average score of Klingons and Romulans. It looks like Klingons on average score 106.6 while Romulans 120.5. It clearly follows that Romulans are better in Parisses squares than Klingons. Or perhaps there is a discrimination and arbiters don't favor Klingons! This is a contentious matter, and may lead to an interplanetary war! Perhaps we can dig deeper into this data and see what is going on. First of all, it looks like the best score of 150 actually belongs to a Klingon. Best Romulan only scored 140. Moreover it looks like tall players score higher than short ones, with 138.8 and 91.5 average scores respectively. So the feature that dominates the result appears to be unrelated to species, and the only reason why Klingons appear weaker on average is because they are most of the time shorter than Romulans. Controlling for height, it shows that tall Klingons on average scored 147.5 while tall Romulans only 136.4. Short Klingons scored on average 95 while short Romulans only 83.3. In both cases the Klingons were on average much better than Romulans. So perhaps if there is a discrimination, it might be in favor of Klingons! Depending on the feature across which the dataset is analyzed, the averages indicate opposite conclusions!
This example shows, that whenever you are faced with a study that analyzes a diverse set of entities with many features and attempts to use an average to support a thesis that e.g. one group is discriminated against the other, you should be extremely cautious. And in case of complex, multidimensional data even more stable estimates of centrality such as median can be subjected to similar errors (exercise to the reader to put together an example).
In general data is typically much more complex than data average, while the underlying reality will often be further much more complex than the gathered data.
Data can be tricky. For one thing, the data you have may not capture the true nature of the phenomenon you study. But even if your data is rich and sufficient you may still get tricked. Choosing principle components may easily amplify noise, to the point of completely getting rid of any useful information. Looking at averages and other simple estimators may reveal completely bogus relationships. Even if data tells you something, it may not tell you why a given relationship emerges.
On average statistics is easy to mess up or tweak to support a particular thesis and most people will fall for such shenanigans. Getting statistics right is hard. Often extremely hard.
If you found an error, highlight it and press Shift + Enter or click here to inform us.