Skip to main content

Envisioning is an emerging technology research institute and advisory.

LinkedInInstagramGitHub

2011 — 2026

research
  • Reports
  • Newsletter
  • Methodology
  • Origins
  • Vocab
services
  • Research Sessions
  • Signals Workspace
  • Bespoke Projects
  • Use Cases
  • Signal Scanfree
  • Readinessfree
impact
  • ANBIMAFuture of Brazilian Capital Markets
  • IEEECharting the Energy Transition
  • Horizon 2045Future of Human and Planetary Security
  • WKOTechnology Scanning for Austria
audiences
  • Innovation
  • Strategy
  • Consultants
  • Foresight
  • Associations
  • Governments
resources
  • Pricing
  • Partners
  • How We Work
  • Data Visualization
  • Multi-Model Method
  • FAQ
  • Security & Privacy
about
  • Manifesto
  • Community
  • Events
  • Support
  • Contact
  • Login
ResearchServicesPricingPartnersAbout
ResearchServicesPricingPartnersAbout
  1. Home
  2. Vocab
  3. Best-of-N

Best-of-N

Generate multiple candidate outputs and select the best using a scoring function.

Year: 2017Generality: 398
Back to Vocab

Best-of-N (also written Best-of-n or BoN) is an inference-time strategy in which a model generates N independent candidate outputs and a scoring function selects the highest-ranked one. Rather than relying on a single forward pass, the approach exploits stochastic variation in the model's output distribution—through sampling with nonzero temperature, for example—to produce a diverse pool of candidates. The selected output is whichever candidate receives the highest score from a reward model, verifier, or other evaluation criterion. Because the probability of at least one high-quality sample increases with N, the strategy trades additional compute at inference time for measurable gains in output quality.

Best-of-N is especially prominent in large language model (LLM) alignment and reasoning research. When a reward model trained via reinforcement learning from human feedback (RLHF) is used as the scorer, Best-of-N becomes a simple but powerful baseline for aligning model outputs to human preferences without any additional fine-tuning. In mathematical reasoning and code generation, verifiers or unit tests serve as the scoring function, allowing the system to filter out incorrect solutions. The strategy is also used as a reference point when evaluating the efficiency of more sophisticated search algorithms such as beam search, Monte Carlo Tree Search, or process-reward-guided decoding.

A key theoretical property of Best-of-N is its predictable scaling behavior: expected reward improves roughly logarithmically with N, and the KL divergence between the BoN distribution and the base model grows as log(N). This makes it a useful tool for studying the compute-quality tradeoff at inference time and for benchmarking how well reward models generalize. Researchers have used BoN scaling curves to compare reward model quality and to understand the limits of inference-time compute scaling.

Despite its simplicity, Best-of-N has practical limitations. Generating N full outputs multiplies inference cost by N, making large values expensive in latency-sensitive or resource-constrained settings. The strategy is also only as good as its scorer—a miscalibrated reward model can consistently select outputs that score well but fail on true quality metrics, a phenomenon known as reward hacking. Nevertheless, its transparency and ease of implementation make it a foundational technique in modern LLM deployment and evaluation.

Related

Related

Top-K
Top-K

Selecting the k highest-scoring items from a model's output for ranking or generation.

Generality: 650
Inference Scaling
Inference Scaling

Improving model outputs by allocating more compute during inference rather than during training

Generality: 812
NTP (Next Token Prediction)
NTP (Next Token Prediction)

A training objective where language models learn to predict the next token in a sequence.

Generality: 795
Reward Model Ensemble
Reward Model Ensemble

Multiple reward models combined to produce more robust, accurate reinforcement learning feedback.

Generality: 293
1-N Systems
1-N Systems

Architectures where a single controller or input manages multiple outputs or agents.

Generality: 398
Greedy Decoding
Greedy Decoding

A sequence generation strategy that always selects the single most probable next token.

Generality: 601