Inference-Time Scaling

Inference-time scaling refers to improving model performance by allocating additional computational resources during inference rather than before inference via training on larger models or more data. In the context of reasoning systems, this typically means increasing the number of computational steps (recursion depth) or the number of parallel reasoning paths (trajectory sampling) at test time. The goal is to extract more capable reasoning from a fixed model by using it more intensively.

For autoregressive language models, inference-time scaling has been explored via chain-of-thought prompting, self-consistency voting, and test-time compute allocation strategies. For recursive reasoning models, inference-time scaling operates differently: depth scaling increases the number of shared transition function applications, while trajectory scaling increases the number of independently sampled latent paths explored in parallel. Both trade latency for reasoning quality, but trajectory scaling provides solution diversity that depth scaling alone cannot.

GRAM demonstrates that inference-time scaling for recursive reasoning has two independent axes that can be combined: recursion depth (more steps per trajectory) and trajectory count (more parallel trajectories). Increasing either axis improves performance, but with different tradeoffs. Depth scaling is more memory-efficient but hits diminishing returns when the model begins looping or converging. Trajectory scaling provides orthogonal gains — more trajectories means more chances to find alternative solutions — but grows memory linearly with trajectory count.

The key insight enabling inference-time scaling for RRMs is that the transition function is shared across all recursion steps and all trajectories. This means adding compute does not require adding parameters. However, inference-time scaling is not free: deeper recursion increases latency proportionally, and parallel trajectories increase memory usage proportionally. Whether inference-time compute can substitute for model scale remains an open empirical question with implications for deploying capable reasoning systems on resource-constrained hardware.

Inference-Time Scaling

Research this in Signals

Inference-Time Scaling

Research this in Signals