The basic unit of text that language models read, process, and generate.
A token is the atomic unit of text that a natural language processing (NLP) system operates on. Rather than processing raw character streams or entire sentences at once, models consume sequences of tokens — discrete chunks that may correspond to words, subwords, punctuation marks, or individual characters depending on the tokenization scheme in use. The process of converting raw text into tokens is called tokenization, and it sits at the very front of nearly every NLP pipeline, shaping everything that follows.
Modern systems most commonly use subword tokenization algorithms such as Byte Pair Encoding (BPE), WordPiece, or SentencePiece. These methods strike a balance between character-level and word-level representations: frequent words are kept intact as single tokens, while rare or unknown words are split into smaller, recognizable pieces. This approach allows a fixed vocabulary of tens of thousands of tokens to cover virtually any input text, including novel words, technical jargon, and multiple languages, without resorting to an "unknown" catch-all symbol.
Tokens are the currency of large language models (LLMs). A model's context window — the amount of text it can attend to at once — is measured in tokens, not words or characters. Pricing for commercial APIs is typically quoted per token. Generation speed, memory consumption, and attention complexity all scale directly with token count. Understanding what constitutes a token in a given system is therefore practically important: a single English word averages roughly 1.3 tokens under common schemes, while code, non-Latin scripts, or whitespace-heavy formatting can tokenize far less efficiently.
The choice of tokenization strategy has measurable downstream effects on model performance. A vocabulary that poorly represents a language or domain forces the model to reconstruct meaning from many small fragments, increasing sequence length and making learning harder. Conversely, a well-matched tokenizer compresses text efficiently, reduces computational cost, and gives the model cleaner, more semantically coherent units to learn from. As multilingual and multimodal models have grown in importance, tokenization design has become an increasingly active area of research in its own right.