AlphaTensor

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph.^[1] Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors (perhaps over a network).

Directly applying the mathematical definition of matrix multiplication gives an algorithm that takes time on the order of $n 3$ field operations to multiply two $n \times n$ matrices over that field ( $Θ(n 3)$ in big O notation). Better asymptotic bounds on the time required to multiply matrices have been known since the Strassen's algorithm in the 1960s, but the optimal time (that is, the computational complexity of matrix multiplication) remains unknown. As of April 2024^[update], the best announced bound on the asymptotic complexity of a matrix multiplication algorithm is $O(n 2.371552)$ time, given by Williams, Xu, Xu, and Zhou.^[2] ^[3] This improves on the bound of $O(n 2.3728596)$ time, given by Alman and Williams.^[4]^[5] However, this algorithm is a galactic algorithm because of the large constants and cannot be realized practically.

Iterative algorithm

The definition of matrix multiplication is that if $C = AB$ for an $n \times m$ matrix $A$ and an $m \times p$ matrix $B$ , then $C$ is an $n \times p$ matrix with entries

c_{ij}=\sum _{k=1}^{m}a_{ik}b_{kj}.

From this, a simple algorithm can be constructed which loops over the indices $i$ from 1 through $n$ and $j$ from 1 through $p$ , computing the above using a nested loop:

Input: matrices $A$ and $B$
Let $C$ be a new matrix of the appropriate size
For i from 1 to n:
- For j from 1 to p:
  - Let $sum = 0$
  - For k from 1 to m:
    - Set $sum \leftarrow sum + A ik \times B kj$
  - Set $C ij \leftarrow sum$
Return $C$

This algorithm takes time $Θ(nmp)$ (in asymptotic notation).^[1] A common simplification for the purpose of algorithm analysis is to assume that the inputs are all square matrices of size $n \times n$ , in which case the running time is $Θ(n 3)$ , i.e., cubic in the size of the dimension.^[6]

Cache behavior

The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time. However, the order can have a considerable impact on practical performance due to the memory access patterns and cache use of the algorithm;^[1] which order is best also depends on whether the matrices are stored in row-major order, column-major order, or a mix of both.

In particular, in the idealized case of a fully associative cache consisting of $M$ bytes and $b$ bytes per cache line (i.e. M/b cache lines), the above algorithm is sub-optimal for $A$ and $B$ stored in row-major order. When $n > M / b$ , every iteration of the inner loop (a simultaneous sweep through a row of $A$ and a column of $B$ ) incurs a cache miss when accessing an element of $B$ . This means that the algorithm incurs $Θ(n 3)$ cache misses in the worst case. As of 2010^[update], the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices.^[7]

The optimal variant of the iterative algorithm for $A$ and $B$ in row-major layout is a tiled version, where the matrix is implicitly divided into square tiles of size $\sqrt M$ by $\sqrt M$ :^[7]^[8]

Input: matrices $A$ and $B$
Let $C$ be a new matrix of the appropriate size
Pick a tile size $T = Θ(\sqrt M)$
For I from 1 to n in steps of T:
- For J from 1 to p in steps of T:
  - For K from 1 to m in steps of T:
    - Multiply $A I : I + T, K : K + T$ and $B K : K + T, J : J + T$ into $C I : I + T, J : J + T$ , that is:
    - For i from I to min(I + T, n):
      - For $j$ from $J$ to $min(J + T, p)$ :
        Let $sum = 0$
        
        For $k$ from $K$ to $min(K + T, m)$ :
        Set $sum \leftarrow sum + A ik \times B kj$
        
        Set $C ij \leftarrow C ij + sum$
Return $C$

In the idealized cache model, this algorithm incurs only $Θ(n 3 / b \sqrt M)$ cache misses; the divisor $b \sqrt M$ amounts to several orders of magnitude on modern machines, so that the actual calculations dominate the running time, rather than the cache misses.^[7]

Divide-and-conquer algorithm

An alternative to the iterative algorithm is the divide-and-conquer algorithm for matrix multiplication. This relies on the block partitioning

C={\begin{pmatrix}C_{11}&C_{12}\\C_{21}&C_{22}\\\end{pmatrix}},\,A={\begin{pmatrix}A_{11}&A_{12}\\A_{21}&A_{22}\\\end{pmatrix}},\,B={\begin{pmatrix}B_{11}&B_{12}\\B_{21}&B_{22}\\\end{pmatrix}},

which works for all square matrices whose dimensions are powers of two, i.e., the shapes are $2 n \times 2 n$ for some $n$ . The matrix product is now

{\begin{pmatrix}C_{11}&C_{12}\\C_{21}&C_{22}\\\end{pmatrix}}={\begin{pmatrix}A_{11}&A_{12}\\A_{21}&A_{22}\\\end{pmatrix}}{\begin{pmatrix}B_{11}&B_{12}\\B_{21}&B_{22}\\\end{pmatrix}}={\begin{pmatrix}A_{11}B_{11}+A_{12}B_{21}&A_{11}B_{12}+A_{12}B_{22}\\A_{21}B_{11}+A_{22}B_{21}&A_{21}B_{12}+A_{22}B_{22}\\\end{pmatrix}}

which consists of eight multiplications of pairs of submatrices, followed by an addition step. The divide-and-conquer algorithm computes the smaller multiplications recursively, using the scalar multiplication $c 11 = a 11 b 11$ as its base case.

The complexity of this algorithm as a function of $n$ is given by the recurrence^[6]

T(1)=\Theta (1);

T(n)=8T(n/2)+\Theta (n^{2}),

accounting for the eight recursive calls on matrices of size $n /2$ and $Θ(n 2)$ to sum the four pairs of resulting matrices element-wise. Application of the master theorem for divide-and-conquer recurrences shows this recursion to have the solution $Θ(n 3)$ , the same as the iterative algorithm.^[6]

Non-square matrices

A variant of this algorithm that works for matrices of arbitrary shapes and is faster in practice^[7] splits matrices in two instead of four submatrices, as follows.^[9] Splitting a matrix now means dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions.

Inputs: matrices $A$ of size $n \times m$ , $B$ of size $m \times p$ .
Base case: if $max(n, m, p)$ is below some threshold, use an unrolled version of the iterative algorithm.
Recursive cases: