Here’s a fact that has been rediscovered many times in many different contexts: The way you parenthesize matrix products can greatly change the time it takes to compute the product. This is important, for example, for the back propagation algorithm in deep learning.
Let A, B, and C be matrices that are compatible for multiplication. Then (AB)C = A(BC). That is, matrix multiplication is associative. But as far as efficiency is concerned, matrix multiplication is not associative: One side of the equation may be much faster to compute than the other.
For any matrix M, let rows(M) be the number of rows in M and let cols(M) be the number of columns. When we multiply two matrices M and N, we multiply every row of M by every column of N [1]. So the number of vector inner products is rows(M) cols(N), and the number of multiplications [2] in each inner product is cols(M). (We must have cols(M) = rows(N) because we implicitly assumed it’s possible to multiple M and N.)
That means the total number of scalar multiplications required to compute MN equals
rows(M) cols(M) cols(N) = rows(M) rows(N) cols(N).
Now let’s apply this to computing ABC. Suppose A is an m by n matrix and C is a p by q matrix. Then B has to be a n by p matrix in order to be compatible for multiplication.
Computing AB requires mnp multiplications. Once we have AB, a m by p matrix, the number of multiplications to compute (AB)C is mpq. So the total multiplications to compute ABC by first computing AB is mp(n + q).
If we compute BC first, this requires npq multiplications, and then multiplying A by BC requires mnq operations, for a total of nq(m + p).
In summary, computing (AB)C requires mp(n + q) multiplications, and computing A(BC) requires (m + p)nq multiplications. I turned the second term around to emphasize that in both expressions you do something to m and p, and something else to n and q. You either multiply and add, or add and multiply.
If you’re trying to minimize these terms, you’d rather add big numbers together than multiply them. So if m and p are large relative to n and q, i.e. both A and C are tall matrices, having more rows than columns, then multiplying A(BC) is going to be faster than multiplying (AB)C.
For example, suppose both A and C have a million rows and a hundred columns. Necessarily B would have a hundred rows and a million columns. Computing (AB)C would require 2×1014 multiplications, but computing A(BC) would take 2×1010 multiplications. That is, the latter would be 10,000 times faster.
Related: Applied linear algebra
***
[1] This post assumes we compute matrix products the usual way, taking the product of every row in the left matrix with every column in the right matrix. It’s theoretically possible, though not practical, to multiply matrices in fewer operations. If the matrices have some exploitable special structure then there could be a faster way to multiply them, but not in general.
[2] It is common in numerical linear algebra to just count multiplications because this gives a good idea of how efficient an algorithm is. Counting additions and multiplications would usually lead to the same conclusions. If you really want to be precise, you’d need to count memory operations. In practice memory management makes more of a difference than whether we count just multiplications or multiplications and additions.
to [1]: AFAICT Strassen’s algorithm is actually practically used occasionally, for some modest speed-up. It is not “entirely” theoretical.
Hello John, excellent article!
I have one question, in the second-to-last paragraph when you say “So if m and p are large relative to n and q, i.e. both A and B are tall matrices”, do you mean A and C are tall matrices?
Yes. I’ll fix that. Thanks.
The number of ways of adding parentheses to a chain of associative operators follows the Ackermann function (i.e. grows faster than exponential in the number of terms). But there is a nice O(n^3) algorithm to count the number of ways to add parentheses.
https://en.wikipedia.org/wiki/Matrix_chain_multiplication
However, curiously, this algorithm doesn’t necessarily give the fastest wall-clock time for matrix multiplication in parallel systems, because it produces a binary tree of operations, and if that tree is not close to being balanced, some of the available processors can be sitting idle waiting for work to do. (This assumes you are not parallelizing in a fine-grained way within a single matrix multiplication.)
There’s a typo, “conmpute”, somewhere.