The labeled examples used to teach a machine learning model.
Training data is the foundational input that machine learning models learn from — a collection of examples, each typically consisting of an input paired with a desired output. During the training process, a model adjusts its internal parameters by repeatedly comparing its predictions against the correct answers in this dataset, gradually minimizing the error between the two. The composition of training data directly shapes what patterns a model can recognize and what behaviors it will exhibit, making data collection and curation as important as model architecture itself.
The quality, quantity, and diversity of training data are critical determinants of model performance. A dataset that is too small may leave a model unable to capture meaningful patterns; one that is biased or unrepresentative will cause the model to perform poorly on real-world inputs it was never exposed to. Two classic failure modes arise from data problems: overfitting, where a model memorizes training examples rather than learning generalizable patterns, and underfitting, where insufficient or uninformative data prevents the model from learning anything useful. Techniques like data augmentation, stratified sampling, and careful labeling protocols are used to address these challenges.
As machine learning has scaled, the demands on training data have grown dramatically. Modern deep learning systems — particularly large language models and vision models — require datasets spanning billions of examples, often scraped from the web, digitized from archives, or generated synthetically. This scale has introduced new concerns around data provenance, consent, copyright, and the amplification of societal biases embedded in historical records. The curation and governance of training data has consequently become a distinct and active area of research.
Training data sits at the intersection of nearly every machine learning discipline, from supervised and unsupervised learning to reinforcement learning, where interaction logs or simulated environments serve as the training signal. Its centrality to model behavior has led to the widely cited principle that in machine learning, data is often more valuable than the algorithm itself — a recognition that no amount of architectural sophistication can compensate for a poorly constructed dataset.