When multimodal AI models produce confident visual analysis from images that were never provided
The mirage effect is a failure mode in multimodal AI models where the system constructs detailed visual descriptions, diagnoses, and reasoning traces from images that were silently removed from the input. Unlike hallucination — which involves generating incorrect details about a real input — the mirage effect involves building an entire fabricated perceptual reality and reasoning from it with high confidence. The term was introduced in a 2026 Stanford paper (MIRAGE) co-authored by Fei-Fei Li.
The researchers tested frontier models including GPT-5.1, Gemini 3 Pro, and Claude Opus 4.5 across six major vision benchmarks, both medical and general. When all images were silently removed without changing prompts, the models continued to score 70-80% accuracy — describing nonexistent X-rays in detail, identifying fake nodules, and diagnosing conditions from text patterns alone. The models did not detect the absence of visual input. More troublingly, a text-only 3-billion-parameter model fine-tuned on the same benchmarks without any images outperformed all frontier multimodal models and even human radiologists on a held-out test set.
The findings expose a structural problem in how multimodal AI capabilities are evaluated: up to 77% of questions in standard vision benchmarks could be answered through text-pattern recognition alone, without any genuine visual understanding. This means that leaderboard scores, benchmark breakthroughs, and claims of multimodal capability may reflect linguistic shortcuts rather than actual perception. The practical implications are severe for high-stakes domains like medical imaging, where mirage-mode diagnoses showed systematic bias toward the most dangerous conditions — STEMI, melanoma, carcinoma — based on textual priors rather than visual evidence.
The mirage effect raises fundamental questions about what it means for a model to 'see.' When a system performs better in mirage mode — not knowing it is blind — than when explicitly told no image is present, the boundary between perception and confabulation becomes disturbingly unclear. The paper's benchmark cleanup methodology (B-Clean) and code are open-sourced, offering a path toward benchmarks that actually test visual understanding rather than text-pattern matching.