Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Observatory
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Chunking

Chunking

Breaking data into meaningful segments to improve processing and comprehension.

Year: 1970Generality: 627
Back to Vocab

Chunking is the process of dividing continuous or complex data into discrete, meaningful segments—called chunks—to make that data easier to process, store, and reason about. Originally a concept from cognitive psychology, it entered AI and machine learning as a practical technique for managing the complexity of structured and unstructured data. In natural language processing, chunking typically refers to shallow parsing: identifying and grouping adjacent tokens into syntactic units such as noun phrases or verb phrases without constructing a full parse tree. This intermediate representation sits between raw tokenization and deep syntactic analysis, offering a computationally efficient way to extract structure from text.

In practice, NLP chunking systems use sequence labeling models—ranging from rule-based grammars to conditional random fields and, more recently, neural architectures—to assign chunk boundary tags to each token. The IOB (Inside-Outside-Beginning) tagging scheme is the standard encoding: a token beginning a chunk receives a B tag, tokens continuing it receive I tags, and tokens outside any chunk receive O tags. This framing converts chunking into a standard sequence classification problem, making it straightforward to train on annotated corpora like the CoNLL-2000 shared task dataset.

Beyond NLP, chunking appears in other ML contexts. In retrieval-augmented generation (RAG) pipelines, documents are split into chunks before being embedded and indexed, directly affecting retrieval quality and downstream generation. Chunk size and overlap are critical hyperparameters: chunks that are too small lose context, while chunks that are too large dilute relevance signals. In reinforcement learning, chunking has been studied as a mechanism for temporal abstraction—grouping sequences of actions into reusable macro-actions or skills, echoing its cognitive-science roots in how humans compress repeated action sequences into single units.

Chunking matters because it bridges raw data and higher-level understanding. Whether segmenting sentences for information extraction, splitting documents for vector search, or abstracting action sequences in planning, the ability to identify meaningful boundaries in a stream of data is foundational to building systems that reason efficiently at multiple levels of granularity.

Related

Related

Chunking Strategy
Chunking Strategy

Grouping data into coherent segments to simplify processing and improve retrieval.

Generality: 521
Token Processing
Token Processing

Segmenting text into discrete units that serve as inputs for NLP models.

Generality: 720
Segmentation
Segmentation

Dividing images or data into meaningful regions to simplify analysis and recognition tasks.

Generality: 796
Decomposition
Decomposition

Breaking a complex problem into smaller, independently solvable subproblems.

Generality: 871
Context Compaction
Context Compaction

Compressing or summarizing context to fit within a model's limited context window.

Generality: 339
Token
Token

The basic unit of text that language models read, process, and generate.

Generality: 720