Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. EDA (Exploratory Data Analysis)

EDA (Exploratory Data Analysis)

Analyzing datasets through statistics and visualization before formal modeling begins.

Year: 1977Generality: 838
Back to Vocab

Exploratory Data Analysis (EDA) is the practice of systematically investigating a dataset before applying formal statistical models or machine learning algorithms. Rather than jumping straight to hypothesis testing or predictive modeling, EDA encourages analysts to first develop an intuitive understanding of the data's structure, distributions, and relationships. This involves computing summary statistics such as means, medians, variances, and correlations, alongside visual tools like histograms, scatter plots, box plots, heatmaps, and pair plots. The goal is to let the data reveal its own story before assumptions are imposed upon it.

In practice, EDA serves several interconnected purposes. It helps identify missing values, duplicate records, and outliers that could distort downstream models. It reveals the shape of feature distributions, which informs decisions about normalization, transformation, or encoding. It uncovers correlations between variables, highlighting potential multicollinearity or useful interaction terms. And it surfaces unexpected patterns or anomalies that may indicate data quality issues or genuinely interesting phenomena worth investigating further. Modern tools like pandas, seaborn, matplotlib, and dedicated EDA libraries such as ydata-profiling have made this process faster and more reproducible.

EDA is especially important in machine learning workflows because model performance is tightly coupled to data quality and feature understanding. A practitioner who skips EDA risks building models on flawed assumptions — feeding a linear model highly skewed features, for instance, or training a classifier on a severely imbalanced target variable without realizing it. EDA also guides feature engineering decisions, helping practitioners decide which variables to include, combine, or discard before training begins. In this sense, EDA is not merely a preliminary step but an ongoing dialogue between the analyst and the dataset.

The concept was formalized by statistician John Tukey, whose 1977 book Exploratory Data Analysis argued for flexible, visual, and inductive approaches to data investigation as a counterweight to rigid confirmatory statistics. Tukey's contributions — including the box plot and stem-and-leaf plot — remain standard tools today. In the machine learning era, EDA has grown in scope and tooling, but its core philosophy remains unchanged: understand your data deeply before you model it.

Related

Related

Data Analysis
Data Analysis

Systematic examination of datasets to extract patterns, insights, and actionable knowledge.

Generality: 928
EDL (Experimentation Driven Learning)
EDL (Experimentation Driven Learning)

A learning paradigm where AI agents improve by actively experimenting within their environment.

Generality: 322
ECDF (Empirical Cumulative Distribution Function)
ECDF (Empirical Cumulative Distribution Function)

A step-function estimator of a dataset's probability distribution requiring no parametric assumptions.

Generality: 692
TDA (Topological Data Analysis)
TDA (Topological Data Analysis)

Applies algebraic topology to extract robust, shape-based features from high-dimensional data.

Generality: 520
Eval (Evaluation)
Eval (Evaluation)

Measuring an AI model's performance against defined metrics and datasets.

Generality: 838
Anomaly Detection
Anomaly Detection

Identifying data points that deviate significantly from expected or normal behavior.

Generality: 840