Fooled by data

Every rule of thumb in data science has a counterexample. Including this one. 

In this post I'd like to explore several simple and low dimensional examples that expose how our typical intuitions about the geometry of data may be fatally flawed. This is generally a practical post, focused on examples, but there is a subtle message I'd like to provide. In essence: be careful. It is easy to make data based conclusions which are totally wrong.

Dimensionality reduction is not always a good idea

It is a fairly common practice to reduce the input data dimension via some projection, typically via principal component analysis (PCA) to get a lower-dimensional, more "condensed" data. This often works fine, as often the directions along which data is separable align with the principal axis. But this does not have to be the case, see a synthetic example below:

from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy.stats import ortho_group
from mpl_toolkits.mplot3d import Axes3D
import numpy as np
import matplotlib.pyplot as plt


N = 10 # Dimension of the data
M = 500 # Number of samples

# Random rotation matrix
R = ortho_group.rvs(dim=N)
# Data variances
variances = np.sort(np.random.rand((N)))[::-1]
Read more...