A prompting framework that guides LLMs to explore multiple reasoning paths simultaneously.
Tree of Thoughts (ToT) is a prompting and inference framework for large language models that structures the reasoning process as a search over a tree of intermediate "thoughts" — coherent text fragments representing partial steps toward a solution. Rather than generating a single linear chain of reasoning, ToT allows the model to branch into multiple candidate continuations at each step, evaluate the promise of each branch, and use search strategies such as breadth-first or depth-first traversal to navigate toward a final answer. This mirrors the way humans deliberate by considering several approaches before committing to one.
The mechanics of ToT involve three core components: a thought generator that produces candidate next steps, an evaluator that scores or classifies each partial solution's viability (often using the LLM itself as a judge), and a search algorithm that decides which branches to expand or prune. This design separates the generative and evaluative roles of the model, enabling systematic exploration of the problem space rather than greedy, left-to-right decoding. The framework is particularly effective on tasks requiring multi-step planning, mathematical reasoning, and combinatorial problem-solving, where a single misstep early in a chain can derail the entire solution.
Tree of Thoughts matters because it exposes a significant limitation of standard chain-of-thought prompting: its vulnerability to early errors that compound without correction. By treating inference as a search problem, ToT dramatically improves performance on challenging benchmarks such as Game of 24 and creative writing tasks with structural constraints. It also connects modern LLM reasoning to classical AI search literature, suggesting that deliberate, tree-structured planning is a powerful complement to the pattern-matching strengths of neural language models. The framework has influenced subsequent work on agent architectures, self-refinement, and inference-time compute scaling.