Best-of-N: Sample Many, Keep the Best

Best-of-N (also written Best-of-n or BoN) is an inference-time strategy in which a model generates N independent candidate outputs and a scoring function selects the highest-ranked one. Rather than relying on a single forward pass, the approach exploits stochastic variation in the model's output distribution—through sampling with nonzero temperature, for example—to produce a diverse pool of candidates. The selected output is whichever candidate receives the highest score from a reward model, verifier, or other evaluation criterion. Because the probability of at least one high-quality sample increases with N, the strategy trades additional compute at inference time for measurable gains in output quality.

Best-of-N is especially prominent in large language model (LLM) alignment and reasoning research. When a reward model trained via reinforcement learning from human feedback (RLHF) is used as the scorer, Best-of-N becomes a simple but powerful baseline for aligning model outputs to human preferences without any additional fine-tuning. In mathematical reasoning and code generation, verifiers or unit tests serve as the scoring function, allowing the system to filter out incorrect solutions. The strategy is also used as a reference point when evaluating the efficiency of more sophisticated search algorithms such as beam search, Monte Carlo Tree Search, or process-reward-guided decoding.

A key theoretical property of Best-of-N is its predictable scaling behavior: expected reward improves roughly logarithmically with N, and the KL divergence between the BoN distribution and the base model grows as log(N). This makes it a useful tool for studying the compute-quality tradeoff at inference time and for benchmarking how well reward models generalize. Researchers have used BoN scaling curves to compare reward model quality and to understand the limits of inference-time compute scaling.

Despite its simplicity, Best-of-N has practical limitations. Generating N full outputs multiplies inference cost by N, making large values expensive in latency-sensitive or resource-constrained settings. The strategy is also only as good as its scorer—a miscalibrated reward model can consistently select outputs that score well but fail on true quality metrics, a phenomenon known as reward hacking. Nevertheless, its transparency and ease of implementation make it a foundational technique in modern LLM deployment and evaluation.

Best-of-N

Related

Best-of-N

Related

Related

Related