Fast Matrix Multiplication

On this page we show one of several possible fast matrix multiplication guarantees, where we approximate AB\boldsymbol{A}^\intercal\boldsymbol{B} as (SA)(SB)(\boldsymbol{S}\boldsymbol{A})^\intercal(\boldsymbol{S}\boldsymbol{B}), where is a randomized sampling matrix. Notably, these guarantees generally have a polynomial dependence on the success probability, though they can be boosted to have a log(1δ)\log(\frac1\delta) success probability by repeating the algorithm.

We consider the following algorithm:

Algorithm 1: Fast Matrix Multiplication


input: Matrices ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d}, BRn×m\boldsymbol{B}\in\mathbb{R}^{n \times m}. Number kk of subsamples. Probabilities p1,,pnp_1,\ldots,p_n

output: Sketched matrix CRd×m\boldsymbol{C}\in\mathbb{R}^{d \times m}

  1. Sample indices s1,,sk[n]s_1,\ldots,s_k\in[n] iid wrt p1,,pnp_1,\ldots,p_n

  2. Build the sample-and-rescale matrix SRk×n\boldsymbol{S}\in\mathbb{R}^{k \times n}:

    Row tt of S\boldsymbol{S} has form [0001kpst00]\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_t}}}&0&\cdots&0\end{bmatrix}, where index sts_t is the nonzero entry.

  3. Return C=(SA)(SB)\boldsymbol{C} = (\boldsymbol{S}\boldsymbol{A})^\intercal(\boldsymbol{S}\boldsymbol{B})

Since (SA)Rd×k(\boldsymbol{S}\boldsymbol{A})^\intercal \in \mathbb{R}^{d \times k} and SBRk×m\boldsymbol{S}\boldsymbol{B}\in\mathbb{R}^{k \times m}, we can compute (SA)(SB)(\boldsymbol{S}\boldsymbol{A})^\intercal(\boldsymbol{S}\boldsymbol{B}) in O(kdm)O(kdm) time instead of O(ndm)O(ndm) time, just using naive matrix multiplication. We show the following:

Theorem 1: Fast Matrix Multiplication
Theorem

Fix ε>0\varepsilon>0 and δ(0,1)\delta\in(0,1). Let C\boldsymbol{C} be the resulting of fast matrix multiplication with k1ε2δk \geq \frac{1}{\varepsilon^2\delta} and p=a22AF2p_\ell = \frac{\|\mathbf{a}_\ell\|_2^2}{\|\boldsymbol{A}\|_F^2}, where a\mathbf{a}_\ell is the th\ell^{th} row of A\boldsymbol{A}. Then with probability 1δ1-\delta,

CABFεAFBF \|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F \leq \varepsilon\|\boldsymbol{A}\|_F \|\boldsymbol{B}\|_F

Notably, we are not hiding any constants when we say k1ε2δk \geq \frac1{\varepsilon^2 \delta}. We prove the results in two steps. First, we show a result for arbitrary sampling probabilities, then we prove Theorem 1. Also, there's a lot of indexing in this analysis, so to be clean we consistently denote t[k]t \in [k], [n]\ell \in [n], i[d]i \in [d], and j[m]j \in [m].

Lemma 1: Expected Squared Error
Lemma

For any sampling probabilities p1,,pdp_1,\ldots,p_d, we have

E[CABF2]1k=1n1pa22b22 \mathbb{E}[\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2] \leq \frac1k \sum_{\ell=1}^n \frac1{p_\ell} \|\mathbf{a}_\ell\|_2^2 \|\mathbf{b}_\ell\|_2^2

where a\mathbf{a}_\ell and b\mathbf{b}_\ell are the th\ell^{th} rows of A\boldsymbol{A} and B\boldsymbol{B} respectively.

Proof. Let Rt=1kpstastbst\boldsymbol{R}_t = \frac1{k p_{s_t}} \mathbf{a}_{s_t} \mathbf{b}_{s_t}^\intercal, so that we have C=t=1kRt\boldsymbol{C} = \sum_{t=1}^k \boldsymbol{R}_t:
C=(SA)(SB)=t=1k1kpstast1kpstbst=t=1kRt\begin{aligned} \boldsymbol{C} &= (\boldsymbol{S}\boldsymbol{A})^\intercal(\boldsymbol{S}\boldsymbol{B}) \\ &= \sum_{t=1}^k \frac1{\sqrt{k p_{s_t}}} \mathbf{a}_{s_t} \cdot \frac1{\sqrt{k p_{s_t}}} \mathbf{b}_{s_t}^\intercal \\ &= \sum_{t=1}^k \boldsymbol{R}_t \end{aligned}
In particular, we see that E[Rt]==1np1kpab=1kAB\mathbb{E}[\boldsymbol{R}_t] = \sum_{\ell=1}^n p_\ell \frac{1}{kp_\ell} \mathbf{a}_\ell\mathbf{b}_\ell^\intercal = \frac1k \boldsymbol{A}^\intercal\boldsymbol{B}, which in turn implies E[C]=AB\mathbb{E}[\boldsymbol{C}] = \boldsymbol{A}^\intercal\boldsymbol{B}. We then can expand and simplify by independence, linearity of variance and by the bound Var[x]E[x2]\text{Var}[x] \leq \mathbb{E}[x^2]:
E[CABF2]=i=1dj=1mE[([CAB]i,j)2]=i=1dj=1mE[(t=1k[Rt]i,jE[Rt]i,j)2]=i=1dj=1mVar[t=1k[Rt]i,j]=ki=1dj=1mVar[[R1]i,j]ki=1dj=1mE[([R1]i,j)2]kE[R1F2]\begin{aligned} \mathbb{E}[\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2] &= \sum_{i=1}^d\sum_{j=1}^m \mathbb{E}\left[ ([\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}]_{i,j})^2 \right] \\ &= \sum_{i=1}^d\sum_{j=1}^m \mathbb{E}\left[ \left(\textstyle{\sum_{t=1}^k [\boldsymbol{R}_t]_{i,j} - \mathbb{E}[\boldsymbol{R}_t]_{i,j}}\right)^2 \right] \\ &= \sum_{i=1}^d\sum_{j=1}^m \text{Var}\left[ \textstyle{\sum_{t=1}^k [\boldsymbol{R}_t]_{i,j}} \right] \\ &= k \sum_{i=1}^d\sum_{j=1}^m \text{Var}\left[ [\boldsymbol{R}_1]_{i,j} \right] \\ &\leq k \sum_{i=1}^d\sum_{j=1}^m \mathbb{E}\left[ \left([\boldsymbol{R}_1]_{i,j}\right)^2 \right] \\ &\leq k \mathbb{E}\left[ \|\boldsymbol{R}_1\|_F^2 \right] \end{aligned}
Since R1\boldsymbol{R}_1 is rank-one, it is simple to compute its Frobenius norm:
R1F2=tr(R1R1)=1k2pst2tr((astbst)(astbst))=1k2pst2tr(bstastastbst)=ast22bst22k2pst2\begin{aligned} \|\boldsymbol{R}_1\|_F^2 &= \text{tr}(\boldsymbol{R}_1^\intercal\boldsymbol{R}_1) \\ &= \frac1{k^2 p_{s_t}^2} \text{tr}((\mathbf{a}_{s_t} \mathbf{b}_{s_t}^\intercal)^\intercal(\mathbf{a}_{s_t}\mathbf{b}_{s_t}^\intercal)) \\ &= \frac1{k^2 p_{s_t}^2} \text{tr}(\mathbf{b}_{s_t}\mathbf{a}_{s_t}^\intercal\mathbf{a}_{s_t}\mathbf{b}_{s_t}^\intercal) \\ &= \frac{\|\mathbf{a}_{s_t}\|_2^2 \|\mathbf{b}_{s_t}\|_2^2}{k^2 p_{s_t}^2} \end{aligned}
And overall, we conclude that
E[CABF2]kE[R1F2]=k=1npa22b22k2p2=1k=1na22b22p\begin{aligned} \mathbb{E}[\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2] &\leq k \mathbb{E}\left[ \|\boldsymbol{R}_1\|_F^2 \right] \\ &= k \cdot \sum_{\ell=1}^n p_\ell \frac{\|\mathbf{a}_\ell\|_2^2 \|\mathbf{b}_\ell\|_2^2}{k^2 p_\ell^2} \\ &= \frac1k \cdot \sum_{\ell=1}^n \frac{\|\mathbf{a}_\ell\|_2^2 \|\mathbf{b}_\ell\|_2^2}{p_\ell} \end{aligned}
\blacksquare \, \,

Having completed this core technical claim, Theorem 1 follows by a short corollary from just plugging in the chosen sampling probabilities. In fact, we prove something slightly broader:

Corollary 1: Oversampling Fast Multiplication
Corollary

Let τ~1,,τ~n\tilde\tau_1,\ldots,\tilde\tau_n be numbers such that τ~a22\tilde\tau_\ell \geq \|\mathbf{a}_\ell\|_2^2 for all [n]\ell\in[n], and let T := =1nτ~T \;{\vcentcolon=}\; \sum_{\ell=1}^n \tilde\tau_\ell. Then let p := τ~Tp_\ell \;{\vcentcolon=}\; \frac{\tilde\tau_\ell}{T} and run Theorem 1. Then, so long as k1ε2δTAF2k \geq \frac{1}{\varepsilon^2 \delta} \cdot \frac{T}{\|\boldsymbol{A}\|_F^2}, with probability 1δ1-\delta we get

CABFεAFBF \|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F \leq \varepsilon\|\boldsymbol{A}\|_F \|\boldsymbol{B}\|_F

Proof. We first bound the expected error from Lemma 1, where we get
CABF21k=1na22b22p=Tk=1na22τ~b22Tk=1nb22=TkBF2\begin{aligned} \|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2 &\leq \frac1k \cdot \sum_{\ell=1}^n \frac{\|\mathbf{a}_\ell\|_2^2 \|\mathbf{b}_\ell\|_2^2}{p_\ell} \\ &= \frac{T}{k} \cdot \sum_{\ell=1}^n \frac{\|\mathbf{a}_\ell\|_2^2}{\tilde\tau_\ell} \|\mathbf{b}_\ell\|_2^2 \\ &\leq \frac{T}{k} \cdot \sum_{\ell=1}^n \|\mathbf{b}_\ell\|_2^2 \\ &= \frac{T}{k} \|\boldsymbol{B}\|_F^2 \end{aligned}
We apply Markov's inequality, which tells us that
Pr[CAB2>ε2AF2BF2]CABF2ε2AF2BF21kTBF2ε2AF2BF2=Tkε2AF2\begin{aligned} \Pr[\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|^2 > \varepsilon^2\|\boldsymbol{A}\|_F^2\|\boldsymbol{B}\|_F^2] &\leq \frac{\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2}{\varepsilon^2\|\boldsymbol{A}\|_F^2\|\boldsymbol{B}\|_F^2} \\ &\leq \frac{\frac 1k T\|\boldsymbol{B}\|_F^2}{\varepsilon^2\|\boldsymbol{A}\|_F^2\|\boldsymbol{B}\|_F^2} \\ &= \frac{T}{k\varepsilon^2\|\boldsymbol{A}\|_F^2} \end{aligned}
Which is at most δ\delta when k>1δε2TAF2k > \frac{1}{\delta\varepsilon^2} \cdot \frac{T}{\|\boldsymbol{A}\|_F^2}.
\blacksquare \, \,

Note that when we compute the norms exactly, so that τ~=a22\tilde\tau_\ell = \|\mathbf{a}_\ell\|_2^2 for all \ell, we get T=τ~=AF2T = \sum_\ell \tilde\tau_\ell = \|\boldsymbol{A}\|_F^2, which recovers Theorem 1 exactly.

See Also

The analysis here is a blend of Drineas, Mahoney (2018) and Nelson (2015), as well as some personal notes by Christopher Musco. Note that randomized matrix multiplication often does not use the exact sampling probabilities discussed here, and the references below discuss a variety of slightly different schemes.

Here's some relevant papers:

  • Drineas, Kannan, Mahoney (2006) is (afaik) the original paper on the topic. Table 1 of this paper compares a great variety of sampling schemes.

  • Drineas, Mahoney (2018) is a book with a section that this page partially copies.

  • Nelson (2015) are lecture notes that this page partially copies.

  • Avron et al. (2019) generalizes this to approximate linear operator multiplication in Claim 45, the "Approximate Operator Application".

  • Let me know if anything is missing

Bibliography