A core algebraic operation that multiplies two matrices to produce a third.
Matrix multiplication is a binary operation that takes two matrices and produces a third by computing dot products between the rows of the first matrix and the columns of the second. Formally, if matrix A has dimensions m×n and matrix B has dimensions n×p, their product C = AB is an m×p matrix where each element C[i,j] equals the sum of A[i,k]×B[k,j] across all k. This row-column dot product structure means the inner dimensions must match, and the operation is generally not commutative — AB ≠ BA.
In machine learning, matrix multiplication is arguably the single most executed computational primitive. Every forward pass through a neural network layer is essentially a matrix multiplication: the input data (batched as a matrix) is multiplied by a weight matrix, optionally followed by a bias addition and nonlinear activation. Backpropagation similarly relies on matrix multiplications to propagate gradients through layers. Attention mechanisms in transformers, convolutional operations reformulated as matrix products, and embedding lookups all reduce to this same core operation at the hardware level.
The practical importance of matrix multiplication in ML is inseparable from hardware acceleration. Modern GPUs and TPUs are architecturally optimized to perform thousands of multiply-accumulate operations in parallel, making large matrix multiplications extremely fast. Libraries like cuBLAS and frameworks like PyTorch and TensorFlow expose highly tuned implementations that exploit this parallelism, enabling the training of models with billions of parameters. Algorithmic improvements — such as Strassen's algorithm, which reduces the naive O(n³) complexity — and hardware-aware techniques like tiling and mixed-precision arithmetic further push efficiency.
As models have grown in scale, matrix multiplication has become a bottleneck that drives hardware design decisions, chip architectures (e.g., NVIDIA's Tensor Cores), and even model design choices like low-rank factorization and sparse attention. Understanding its mechanics and computational cost is essential for anyone working on model efficiency, hardware deployment, or architecture design in modern AI.