Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Dataset

Dataset

A structured collection of data used to train, validate, and evaluate machine learning models.

Year: 1962Generality: 968
Back to Vocab

A dataset is an organized collection of data points—examples, observations, or records—used to train, validate, and test machine learning models. Datasets can take many forms depending on the task at hand: tabular spreadsheets for structured prediction, image libraries for computer vision, text corpora for natural language processing, or complex multimodal collections combining several data types. Each individual entry typically consists of input features and, in supervised settings, an associated label or target value that the model learns to predict.

In practice, datasets are almost always partitioned into distinct subsets serving different purposes. The training set exposes the model to examples from which it learns patterns; the validation set guides hyperparameter tuning and architecture decisions without contaminating the final evaluation; and the held-out test set provides an unbiased estimate of real-world performance. The relative sizes of these splits, commonly 70/15/15 or 80/10/10, depend on the total volume of available data and the sensitivity of the task.

Data quality is arguably the single most important factor in determining model performance—often more influential than algorithmic choices. Practitioners invest substantial effort in data cleaning (removing duplicates, correcting errors), normalization (scaling features to comparable ranges), and augmentation (synthetically expanding the dataset through transformations like cropping or paraphrasing). Poorly curated datasets introduce biases that propagate directly into model behavior, making responsible data collection and documentation an ethical as well as technical concern.

Benchmark datasets have been transformative for the field, enabling fair comparisons across research groups and catalyzing rapid progress. Landmark examples include MNIST for handwritten digit recognition, ImageNet for large-scale visual recognition, and the Penn Treebank for language modeling. The public release of large, carefully labeled datasets has repeatedly unlocked new capabilities—ImageNet's publication in 2009, for instance, directly enabled the deep learning breakthroughs of the following decade. As models grow larger and more capable, the curation, governance, and licensing of datasets have become central concerns in both research and industry.

Related

Related

Training Data
Training Data

The labeled examples used to teach a machine learning model.

Generality: 920
Golden Dataset
Golden Dataset

A curated, high-quality reference dataset used to benchmark and evaluate AI models.

Generality: 520
Test Set
Test Set

A held-out dataset used to evaluate a trained model's real-world generalization.

Generality: 820
Validation Data
Validation Data

A held-out dataset used to tune and evaluate models during training.

Generality: 820
Labeled Example
Labeled Example

A data point paired with a known output used to train supervised learning models.

Generality: 794
Validation Set
Validation Set

A held-out dataset used to tune hyperparameters and guide model development.

Generality: 820