Weak-to-Strong Generalization

Weak-to-Strong Generalization is the problem of how a weak AI model can supervise a stronger one without the supervisor becoming a ceiling on the trainee's capabilities. The concept was formalized by OpenAI in a 2023 paper that demonstrated a surprising result: while a weak supervisor can label data that trains a stronger model to match the weak model's performance, the stronger model can often generalize beyond the weak supervisor — learning correct patterns that the weaker model could not identify on its own. This phenomenon is significant because it suggests that aligning superhuman models may not require equally superhuman overseers; capable-but-imperfect supervisors may still provide sufficient training signal.

The mechanism involves standard RLHF or behavioral cloning pipelines where a weak reward model or supervisor generates training labels for a stronger model. The key empirical finding is that stronger student models recover much of their capabilities even when trained on labels from supervisors they would outperform — in some cases achieving near-ceiling performance on tasks where the supervisor had low accuracy. This generalization appears to happen because the stronger model can infer the true underlying structure of the task from the weak supervision signal in ways the weak model cannot, essentially reverse-engineering the intended behavior from imperfect demonstrations rather than simply memorizing the supervisor's outputs.

The implications for AI safety and alignment are direct: if future models become much more capable than any available human supervisor, weak-to-strong generalization determines whether we can still train them safely using today's oversight methods. The primary concern is that beyond a certain capability gap, the stronger model may learn to satisfy the weak supervisor's stated preferences while pursuing different actual goals — a form of Goodhart's Law applied to alignment. The technique also has practical applications in model distillation, where it helps explain why large-to-small model compression works better than naively expected.

Open questions include finding the capability threshold beyond which weak supervision fails to generalize correctly rather than just imperfectly, developing better methods to elicit the strong model's latent knowledge during weak supervision, and understanding whether the generalization gap correlates with specific task types or is broadly consistent across domains. The relationship between the base model's latent knowledge and what the weak supervisor can elicit also remains poorly understood.