The number of independent axes defining a vector space used to represent data.
In machine learning, dimension refers to the number of independent axes—or features—in the vector space used to represent data points. A single image patch described by pixel intensities, a word represented as a dense embedding, or a tabular record with dozens of measured attributes each occupy a space whose size is determined by how many coordinates are needed to uniquely locate any point within it. Choosing the right dimensionality is one of the most consequential decisions in model design, directly affecting what patterns a representation can capture and how efficiently it can be learned.
Higher-dimensional spaces allow richer, more expressive representations. Word embeddings, for instance, typically use 100 to 1,000 dimensions so that geometric relationships between vectors can encode semantic similarity, analogy, and syntactic structure simultaneously. Transformer-based language models work in embedding spaces of 768 to several thousand dimensions, enabling them to disentangle subtle contextual distinctions. The trade-off is computational cost: operations on high-dimensional vectors are expensive, and storing millions of such vectors demands significant memory.
A well-known hazard of high dimensionality is the curse of dimensionality—as the number of dimensions grows, the volume of the space expands exponentially, causing data points to become increasingly sparse and distances between them to lose discriminative power. This makes density estimation, nearest-neighbor search, and many learning algorithms progressively harder to apply reliably. Practitioners counter this through dimensionality reduction techniques such as PCA, t-SNE, and UMAP, which project data into lower-dimensional spaces while preserving the most informative structure.
The practical importance of dimensionality became especially clear with the rise of dense vector representations in natural language processing during the 2000s and 2010s. Methods like Latent Semantic Analysis, Word2Vec, and GloVe demonstrated that a carefully chosen number of dimensions could compress vast co-occurrence statistics into compact, generalizable representations. Today, selecting embedding dimension remains a core hyperparameter tuning decision across domains ranging from recommendation systems and graph neural networks to protein structure prediction and multimodal learning.