bGPT: The Transformer That Skips Tokenization

bGPT is a variant of the GPT transformer architecture that operates directly on raw bytes rather than tokenized words or subword units. Traditional language models rely on tokenizers — algorithms that segment text into discrete units like words, wordpieces, or byte-pair encodings — before feeding data into the model. bGPT bypasses this preprocessing step entirely, treating every byte of input as a discrete token. This makes the model agnostic to language, encoding scheme, or data format, enabling it to handle arbitrary byte sequences including multilingual text, source code, binary data, and inputs containing emojis or non-standard characters without any domain-specific tokenization logic.

The core challenge of byte-level modeling is sequence length. Because bytes are smaller units than words or subwords, any given input expands into a much longer sequence, which strains the quadratic attention mechanism of standard transformers. bGPT and related byte-level models address this through architectural innovations such as hierarchical processing, local attention windows, or efficient attention variants that reduce the computational cost of attending over long sequences. These techniques allow the model to capture both fine-grained byte-level patterns and longer-range dependencies without prohibitive memory or compute requirements.

The appeal of byte-level models lies in their universality and simplicity. By removing the tokenizer, bGPT eliminates a significant source of brittleness: tokenizers trained on one domain or language often perform poorly on others, and they can introduce artifacts that affect downstream model behavior. A byte-level model trained on sufficiently diverse data can, in principle, generalize across modalities and languages without any tokenization-related failure modes. This makes bGPT particularly attractive for tasks involving code, low-resource languages, or mixed-format documents where standard tokenizers struggle.

bGPT represents a broader research direction exploring whether the abstraction layer of tokenization is truly necessary or merely a historical convenience. While byte-level models have not yet displaced tokenized models in mainstream large-scale deployments — largely due to the computational overhead of longer sequences — ongoing efficiency research continues to close this gap, and byte-level approaches remain an active area of investigation in the pursuit of more general-purpose language models.

bGPT (Byte-Level Transformer)

Related

bGPT (Byte-Level Transformer)

Related

Related

Related