Training models on text and images interleaved within single sequences.
Interleaved multimodal training is a data composition strategy where text and image tokens are mixed within individual training sequences rather than processed in separate streams or batches.
Instead of training on pure text corpora and image corpora independently, interleaved training weaves multiple modalities together from the first training step. A sequence might contain a paragraph of text, followed by an image, followed by more text that discusses the image, creating natural cross-modal associations. This forces the model to learn how visual information relates to textual context in semantically grounded ways. By seeing images in their textual context — documents, code files, web pages — models develop deeper cross-modal understanding than they would from isolated modality data.
The key tradeoff is increased data complexity. Building high-quality interleaved corpora requires naturally multimodal documents, which are rarer than pure text or image-caption data. The training process is also more sensitive to the ratio and arrangement of modalities. Whether this approach from step zero is strictly superior to staged multimodal training — separate modality pretraining followed by interleaved fine-tuning — remains an open question. The impact of interleaved training scale, and whether it yields emergent capabilities at large scale, is also actively researched.
Open questions include optimal ratios of modalities at different training stages, whether automated or human-curated interleaving produces better results, and whether the semantic alignment quality from naturally interleaved data outweighs the curational overhead of finding such documents at scale.