Capsule Networks: Beyond CNN Pooling

Capsule Networks (CapsNets) are a class of neural network architecture designed to overcome a fundamental weakness of convolutional neural networks: the loss of spatial and pose information caused by pooling operations. Where a CNN might correctly identify that eyes, a nose, and a mouth are present in an image without caring about their relative positions, a capsule network explicitly encodes the spatial relationships between features. Each capsule is a small group of neurons whose output is a vector rather than a scalar — the vector's magnitude represents the probability that a particular entity exists, while its direction encodes instantiation parameters such as position, orientation, scale, and deformation.

The key mechanism enabling capsule networks is dynamic routing, introduced in the landmark 2017 paper "Dynamic Routing Between Capsules" by Sara Sabour, Nicholas Frosst, and Geoffrey Hinton. Instead of fixed pooling, lower-level capsules iteratively negotiate with higher-level capsules to determine which parent capsule they should send their output to. A lower-level capsule routes its output to a higher-level capsule whose current prediction best agrees with its own — a process called routing by agreement. This allows the network to build part-whole relationships in a principled way, making it naturally equivariant to spatial transformations rather than relying on data augmentation to learn invariance.

Capsule networks matter because they address a deep theoretical concern about how CNNs represent structured visual information. They show strong performance on tasks requiring viewpoint robustness and generalize better from limited training data on certain benchmarks. However, they have not yet displaced CNNs in mainstream practice, largely due to computational cost and difficulty scaling to complex, high-resolution datasets. Research into more efficient routing algorithms and hybrid architectures continues, and capsule networks remain an important conceptual framework for thinking about how neural networks could more faithfully represent the compositional, hierarchical structure of the visual world.

Capsule Networks

Related

Capsule Networks

Related

Related

Related