Low-quality, generic AI-generated content that is verbose, repetitive, or contextually hollow.
"Slop" is informal slang for AI-generated content—particularly from large language models—that is technically fluent but substantively poor. It typically manifests as verbose, repetitive, or contextually hollow output that fills space without delivering genuine insight or precision. The term captures a specific failure mode distinct from factual hallucination: slop may be technically accurate yet still feel padded, generic, or disconnected from what the user actually needed. It is the textual equivalent of filler—words that satisfy surface-level coherence while missing the mark on depth or relevance.
Slop emerges from how LLMs are trained and prompted. Models optimized for human preference ratings can learn to produce responses that seem thorough and helpful—hedging extensively, restating the question, listing caveats—without actually being useful. Reinforcement learning from human feedback (RLHF) can inadvertently reward length and apparent comprehensiveness over conciseness and precision. Similarly, when models are deployed with system prompts encouraging politeness or thoroughness, the result is often bloated output that buries the answer in preamble and qualification.
The concept gained cultural traction around 2023–2024 as LLM-generated content flooded search results, content farms, customer service interfaces, and social media. Critics began using "slop" to describe not just chatbot verbosity but entire ecosystems of AI-generated articles, product descriptions, and summaries that were syntactically correct but intellectually vacant. The term extended beyond individual responses to characterize a broader degradation of information quality online, where high-volume AI output crowds out carefully crafted human writing.
For practitioners, slop is a practical alignment and evaluation challenge. Metrics like BLEU or perplexity do not capture it well, since sloppy output can score highly on fluency benchmarks while failing real users. Addressing it requires better reward modeling, tighter prompt engineering, output length constraints, and evaluation frameworks that penalize unnecessary verbosity. As LLMs become embedded in more high-stakes workflows, distinguishing genuinely useful generation from polished-sounding slop remains an open and important problem.