Every rule of thumb in data science has a counterexample. Including this one.
In this post I'd like to explore several simple and low dimensional examples that expose how our typical intuitions about the geometry of data may be fatally flawed. This is generally a practical post, focused on examples, but there is a subtle message I'd like to provide. In essence: be careful. It is easy to make data based conclusions which are totally wrong.
Dimensionality reduction is not always a good idea
It is a fairly common practice to reduce the input data dimension via some projection, typically via principal component analysis (PCA) to get a lower-dimensional, more "condensed" data. This often works fine, as often the directions along which data is separable align with the principal axis. But this does not have to be the case, see a synthetic example below:
from sklearn.neural_network import MLPClassifier from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from scipy.stats import ortho_group from mpl_toolkits.mplot3d import Axes3D import numpy as np import matplotlib.pyplot as plt N = 10 # Dimension of the data M = 500 # Number of samples # Random rotation matrix R = ortho_group.rvs(dim=N) # Data variances variances = np.sort(np.random.rand((N)))[::-1]… Read more...