A neural architecture that infers and renders 3D scenes from limited viewpoint observations.
A Generative Query Network (GQN) is a neural network architecture developed by DeepMind in 2018 that enables machines to build internal representations of visual scenes from a small number of observed viewpoints and use those representations to render the scene from entirely new, unseen perspectives. Unlike traditional 3D reconstruction pipelines that rely on explicit geometric supervision or depth sensors, GQNs learn purely from raw image data, making them a landmark example of unsupervised spatial reasoning in deep learning.
The architecture consists of two coupled components: a representation network and a generation network. The representation network processes one or more observed images of a scene—each paired with the camera pose from which it was taken—and encodes them into a compact latent summary vector. This vector captures the scene's content without being tied to any particular viewpoint. The generation network then conditions on this representation along with a query camera pose and produces a predicted image of the scene from that new angle. Training uses a variational objective, encouraging the model to learn structured, generalizable scene representations rather than memorizing specific views.
What makes GQNs particularly significant is their implicit learning of 3D scene structure. The model is never told about geometry, lighting, or object identity—it infers these properties indirectly through the consistency constraints imposed by multi-view prediction. This allows GQNs to disentangle scene content from viewpoint, a capability that had previously required far more engineered solutions. Experiments demonstrated that GQNs could generalize to novel scenes and novel query angles with impressive fidelity, even in environments with multiple objects and complex spatial relationships.
GQNs matter because they represent a principled step toward machines that can reason about the physical world in a human-like way—forming mental models of environments from partial information. Their influence extends into robotics, where agents must navigate and act in 3D spaces from limited sensory input, and into the broader field of neural rendering, which has since expanded dramatically with techniques like NeRF. The GQN established that generative models could serve as genuine scene understanding engines, not merely image synthesis tools.