Fast Matrix Multiplication

On this page we show one of several possible fast matrix multiplication guarantees, where we approximate $\boldsymbol{A}^\intercal\boldsymbol{B}$ as $(\boldsymbol{S}\boldsymbol{A})^\intercal(\boldsymbol{S}\boldsymbol{B})$ , where is a randomized sampling matrix. Notably, these guarantees generally have a polynomial dependence on the success probability, though they can be boosted to have a $\log(\frac1\delta)$ success probability by repeating the algorithm.

We consider the following algorithm:

Algorithm 1: Fast Matrix Multiplication

input: Matrices $\boldsymbol{A}\in\mathbb{R}^{n \times d}$ , $\boldsymbol{B}\in\mathbb{R}^{n \times m}$ . Number $k$ of subsamples. Probabilities $p_1,\ldots,p_n$

output: Sketched matrix $\boldsymbol{C}\in\mathbb{R}^{d \times m}$

Sample indices $s_1,\ldots,s_k\in[n]$ iid wrt $p_1,\ldots,p_n$
Build the sample-and-rescale matrix $\boldsymbol{S}\in\mathbb{R}^{k \times n}$ :

Row $t$ of $\boldsymbol{S}$ has form $\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_t}}}&0&\cdots&0\end{bmatrix}$ , where index $s_t$ is the nonzero entry.
Return $\boldsymbol{C} = (\boldsymbol{S}\boldsymbol{A})^\intercal(\boldsymbol{S}\boldsymbol{B})$

Since $(\boldsymbol{S}\boldsymbol{A})^\intercal \in \mathbb{R}^{d \times k}$ and $\boldsymbol{S}\boldsymbol{B}\in\mathbb{R}^{k \times m}$ , we can compute $(\boldsymbol{S}\boldsymbol{A})^\intercal(\boldsymbol{S}\boldsymbol{B})$ in $O(kdm)$ time instead of $O(ndm)$ time, just using naive matrix multiplication. We show the following:

Theorem 1: Fast Matrix Multiplication
Theorem

Fix

\varepsilon>0

and

\delta\in(0,1)

. Let

\boldsymbol{C}

be the resulting of fast matrix multiplication with

k \geq \frac{1}{\varepsilon^2\delta}

and

p_\ell = \frac{\|\mathbf{a}_\ell\|_2^2}{\|\boldsymbol{A}\|_F^2}

, where

\mathbf{a}_\ell

is the

\ell^{th}

row of

\boldsymbol{A}

. Then with probability

1-\delta

\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F \leq \varepsilon\|\boldsymbol{A}\|_F \|\boldsymbol{B}\|_F

Notably, we are not hiding any constants when we say $k \geq \frac1{\varepsilon^2 \delta}$ . We prove the results in two steps. First, we show a result for arbitrary sampling probabilities, then we prove Theorem 1. Also, there's a lot of indexing in this analysis, so to be clean we consistently denote $t \in [k]$ , $\ell \in [n]$ , $i \in [d]$ , and $j \in [m]$ .

Lemma 1: Expected Squared Error
Lemma

For any sampling probabilities

p_1,\ldots,p_d

, we have

\mathbb{E}[\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2] \leq \frac1k \sum_{\ell=1}^n \frac1{p_\ell} \|\mathbf{a}_\ell\|_2^2 \|\mathbf{b}_\ell\|_2^2

where $\mathbf{a}_\ell$ and $\mathbf{b}_\ell$ are the $\ell^{th}$ rows of $\boldsymbol{A}$ and $\boldsymbol{B}$ respectively.

Proof. Let

\boldsymbol{R}_t = \frac1{k p_{s_t}} \mathbf{a}_{s_t} \mathbf{b}_{s_t}^\intercal

, so that we have

\boldsymbol{C} = \sum_{t=1}^k \boldsymbol{R}_t

\begin{aligned} \boldsymbol{C} &= (\boldsymbol{S}\boldsymbol{A})^\intercal(\boldsymbol{S}\boldsymbol{B}) \\ &= \sum_{t=1}^k \frac1{\sqrt{k p_{s_t}}} \mathbf{a}_{s_t} \cdot \frac1{\sqrt{k p_{s_t}}} \mathbf{b}_{s_t}^\intercal \\ &= \sum_{t=1}^k \boldsymbol{R}_t \end{aligned}

In particular, we see that

\mathbb{E}[\boldsymbol{R}_t] = \sum_{\ell=1}^n p_\ell \frac{1}{kp_\ell} \mathbf{a}_\ell\mathbf{b}_\ell^\intercal = \frac1k \boldsymbol{A}^\intercal\boldsymbol{B}

, which in turn implies

\mathbb{E}[\boldsymbol{C}] = \boldsymbol{A}^\intercal\boldsymbol{B}

. We then can expand and simplify by independence, linearity of variance and by the bound

\text{Var}[x] \leq \mathbb{E}[x^2]

\begin{aligned} \mathbb{E}[\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2] &= \sum_{i=1}^d\sum_{j=1}^m \mathbb{E}\left[ ([\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}]_{i,j})^2 \right] \\ &= \sum_{i=1}^d\sum_{j=1}^m \mathbb{E}\left[ \left(\textstyle{\sum_{t=1}^k [\boldsymbol{R}_t]_{i,j} - \mathbb{E}[\boldsymbol{R}_t]_{i,j}}\right)^2 \right] \\ &= \sum_{i=1}^d\sum_{j=1}^m \text{Var}\left[ \textstyle{\sum_{t=1}^k [\boldsymbol{R}_t]_{i,j}} \right] \\ &= k \sum_{i=1}^d\sum_{j=1}^m \text{Var}\left[ [\boldsymbol{R}_1]_{i,j} \right] \\ &\leq k \sum_{i=1}^d\sum_{j=1}^m \mathbb{E}\left[ \left([\boldsymbol{R}_1]_{i,j}\right)^2 \right] \\ &\leq k \mathbb{E}\left[ \|\boldsymbol{R}_1\|_F^2 \right] \end{aligned}

Since

\boldsymbol{R}_1

is rank-one, it is simple to compute its Frobenius norm:

\begin{aligned} \|\boldsymbol{R}_1\|_F^2 &= \text{tr}(\boldsymbol{R}_1^\intercal\boldsymbol{R}_1) \\ &= \frac1{k^2 p_{s_t}^2} \text{tr}((\mathbf{a}_{s_t} \mathbf{b}_{s_t}^\intercal)^\intercal(\mathbf{a}_{s_t}\mathbf{b}_{s_t}^\intercal)) \\ &= \frac1{k^2 p_{s_t}^2} \text{tr}(\mathbf{b}_{s_t}\mathbf{a}_{s_t}^\intercal\mathbf{a}_{s_t}\mathbf{b}_{s_t}^\intercal) \\ &= \frac{\|\mathbf{a}_{s_t}\|_2^2 \|\mathbf{b}_{s_t}\|_2^2}{k^2 p_{s_t}^2} \end{aligned}

And overall, we conclude that

\begin{aligned} \mathbb{E}[\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2] &\leq k \mathbb{E}\left[ \|\boldsymbol{R}_1\|_F^2 \right] \\ &= k \cdot \sum_{\ell=1}^n p_\ell \frac{\|\mathbf{a}_\ell\|_2^2 \|\mathbf{b}_\ell\|_2^2}{k^2 p_\ell^2} \\ &= \frac1k \cdot \sum_{\ell=1}^n \frac{\|\mathbf{a}_\ell\|_2^2 \|\mathbf{b}_\ell\|_2^2}{p_\ell} \end{aligned}

\blacksquare \, \,

Having completed this core technical claim, Theorem 1 follows by a short corollary from just plugging in the chosen sampling probabilities. In fact, we prove something slightly broader:

Corollary 1: Oversampling Fast Multiplication
Corollary

Let

\tilde\tau_1,\ldots,\tilde\tau_n

be numbers such that

\tilde\tau_\ell \geq \|\mathbf{a}_\ell\|_2^2

for all

\ell\in[n]

, and let

T \;{\vcentcolon=}\; \sum_{\ell=1}^n \tilde\tau_\ell

. Then let

p_\ell \;{\vcentcolon=}\; \frac{\tilde\tau_\ell}{T}

and run Theorem 1. Then, so long as

k \geq \frac{1}{\varepsilon^2 \delta} \cdot \frac{T}{\|\boldsymbol{A}\|_F^2}

, with probability

1-\delta

we get

\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F \leq \varepsilon\|\boldsymbol{A}\|_F \|\boldsymbol{B}\|_F

Proof. We first bound the expected error from Lemma 1, where we get

\begin{aligned} \|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2 &\leq \frac1k \cdot \sum_{\ell=1}^n \frac{\|\mathbf{a}_\ell\|_2^2 \|\mathbf{b}_\ell\|_2^2}{p_\ell} \\ &= \frac{T}{k} \cdot \sum_{\ell=1}^n \frac{\|\mathbf{a}_\ell\|_2^2}{\tilde\tau_\ell} \|\mathbf{b}_\ell\|_2^2 \\ &\leq \frac{T}{k} \cdot \sum_{\ell=1}^n \|\mathbf{b}_\ell\|_2^2 \\ &= \frac{T}{k} \|\boldsymbol{B}\|_F^2 \end{aligned}

We apply Markov's inequality, which tells us that

\begin{aligned} \Pr[\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|^2 > \varepsilon^2\|\boldsymbol{A}\|_F^2\|\boldsymbol{B}\|_F^2] &\leq \frac{\|\boldsymbol{C}-\boldsymbol{A}^\intercal\boldsymbol{B}\|_F^2}{\varepsilon^2\|\boldsymbol{A}\|_F^2\|\boldsymbol{B}\|_F^2} \\ &\leq \frac{\frac 1k T\|\boldsymbol{B}\|_F^2}{\varepsilon^2\|\boldsymbol{A}\|_F^2\|\boldsymbol{B}\|_F^2} \\ &= \frac{T}{k\varepsilon^2\|\boldsymbol{A}\|_F^2} \end{aligned}

Which is at most

\delta

when

k > \frac{1}{\delta\varepsilon^2} \cdot \frac{T}{\|\boldsymbol{A}\|_F^2}

\blacksquare \, \,

Note that when we compute the norms exactly, so that $\tilde\tau_\ell = \|\mathbf{a}_\ell\|_2^2$ for all $\ell$ , we get $T = \sum_\ell \tilde\tau_\ell = \|\boldsymbol{A}\|_F^2$ , which recovers Theorem 1 exactly.

Bibliography

Avron, Kapralov, Musco, Musco, Velingker, and Zandieh. A Universal Sampling Method for Reconstructing Signals with Simple Fourier Transforms. STOC 2019.
Drineas, Kannan, and Mahoney. Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication. SIAM Journal on Computing 2006.
Drineas and Mahoney. Lectures on Randomized Numerical Linear Algebra. The Mathematics of Data 2018.
Nelson. Lecture 15. Lecture Notes, 2015.

Fast Matrix Multiplication

Theorem 1: Fast Matrix Multiplication Theorem

Lemma 1: Expected Squared Error Lemma

Corollary 1: Oversampling Fast Multiplication Corollary

See Also

Bibliography

Theorem 1: Fast Matrix Multiplication
Theorem

Lemma 1: Expected Squared Error
Lemma

Corollary 1: Oversampling Fast Multiplication
Corollary