Transforming high-dimensional data into fewer dimensions while preserving essential structure.
Dimensionality reduction is a family of techniques that transform high-dimensional datasets into lower-dimensional representations, retaining as much meaningful structure as possible. In machine learning, models trained on data with many features often suffer from the "curse of dimensionality" — a phenomenon where the volume of the feature space grows so rapidly that available training data becomes sparse, leading to overfitting, poor generalization, and prohibitive computational costs. By compressing data into fewer dimensions, dimensionality reduction counteracts these effects and makes downstream modeling more tractable.
Techniques fall broadly into two categories: linear and nonlinear. Linear methods like Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) project data onto lower-dimensional subspaces by finding directions of maximum variance or maximum class separability, respectively. Nonlinear methods — often called manifold learning techniques — include t-Distributed Stochastic Neighbor Embedding (t-SNE), UMAP, and autoencoders, which can capture complex curved structures in data that linear projections would distort or destroy. Each approach involves trade-offs between computational cost, interpretability, and fidelity to the original data geometry.
Dimensionality reduction serves two broad purposes in practice. First, it acts as a preprocessing step that improves the performance and efficiency of classifiers, clustering algorithms, and regression models by removing redundant or noisy features. Second, it enables visualization: projecting data into two or three dimensions allows practitioners to inspect cluster structure, detect outliers, and build intuition about dataset geometry — insights that are simply inaccessible in the original high-dimensional space. Tools like t-SNE and UMAP have become especially popular for visualizing embeddings produced by deep neural networks.
As datasets have grown larger and more complex — spanning genomics, computer vision, natural language processing, and beyond — dimensionality reduction has become an indispensable part of the machine learning workflow. Modern deep learning has also blurred the line between feature learning and dimensionality reduction: the internal representations learned by neural networks are themselves a form of learned dimensionality reduction, compressing raw inputs into compact, task-relevant codes.