Converting raw data into numerical vectors so machine learning algorithms can process it.
Vectorization is a foundational preprocessing technique in machine learning that transforms raw data—text, images, audio, or categorical variables—into numerical vectors that algorithms can compute over. Because most ML models operate on matrices of real numbers, any data that doesn't arrive in that form must be encoded before training or inference can proceed. The choice of vectorization strategy directly shapes what information is preserved and what is discarded, making it one of the most consequential decisions in a machine learning pipeline.
For text data, vectorization methods range from simple bag-of-words and one-hot encoding to weighted schemes like TF-IDF, which balances term frequency against how common a word is across documents. Dense word embeddings such as Word2Vec, GloVe, and contextual representations from transformer models like BERT represent a more sophisticated tier, mapping words or sentences into continuous vector spaces where geometric proximity reflects semantic similarity. For images, vectorization may be as straightforward as flattening pixel arrays into a 1D vector, or as rich as extracting feature maps from intermediate layers of a convolutional neural network. Tabular data requires its own strategies—ordinal encoding for ordered categories, one-hot encoding for nominal ones, and normalization for continuous features.
The importance of vectorization extends beyond mere format conversion. A well-chosen representation can make a problem tractable that would otherwise be intractable: sparse TF-IDF vectors enabled scalable document retrieval long before deep learning, while dense embeddings unlocked transfer learning across NLP tasks. Poorly chosen representations, by contrast, can introduce noise, destroy structure, or inflate dimensionality to the point where models fail to generalize. The field of representation learning is in many ways a systematic effort to automate and optimize vectorization itself, learning the best encoding directly from data rather than engineering it by hand.
Vectorization also carries a second, related meaning in numerical computing: the use of SIMD (single instruction, multiple data) hardware instructions to apply operations across entire arrays simultaneously rather than element by element. This computational sense of vectorization is what makes libraries like NumPy and frameworks like PyTorch and TensorFlow fast in practice, and the two meanings converge in ML workflows where data must first be encoded into vectors and then processed efficiently at scale.