Subspace Embedding via Leverage Score Sampling

On this page, we show how sampling a matrix via its leverage scores creates a subspace embedding guarantee. Subspace embeddings are core to much of RandNLA, so many papers either assume they are given a subspace embedding, or create a proof much like the one below (just fine-tuned to research their setting).

Prerequisite: Basic Properties of Leverage Scores

We consider the following algorithm:

Algorithm 1: Leverage Score Subspace Embedding

input: Matrix $\boldsymbol{A}\in\mathbb{R}^{n \times d}$ . Number $k$ of subsamples.

output: Sketched matrix $\tilde{\boldsymbol{A}}\in\mathbb{R}^{k \times d}$

Sample indices $s_1,\ldots,s_k\in[n]$ iid with respect to the leverage scores of $\boldsymbol{A}$
Let $p_i \;{\vcentcolon=}\; \frac{\tau_i}{d}$ denote the probability of sampling row $i$
Build the sample-and-rescale matrix $\boldsymbol{S}\in\mathbb{R}^{k \times n}$ :

Row $i$ of $\boldsymbol{S}$ has form $\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}$ , where index $s_i$ is the nonzero entry.
Return $\tilde{\boldsymbol{A}} = \boldsymbol{S}\boldsymbol{A}$

This algorithm is equivalent to building $\tilde{\boldsymbol{A}}$ by sampling $k$ rows of $\boldsymbol{A}$ with respect to leverage scores, rescaling those rows by $\frac{1}{\sqrt{k p_{s_i}}}$ , and stacking those rows up. The matrix $\boldsymbol{S}$ is just a formalization that makes it easier to analyze this algorithm.

Theorem 1: Subspace Embedding
Theorem

Fix

\varepsilon > 0

\delta > 0

and run the above algorithm with

k = O(\frac{d\log(\frac d\delta)}{\varepsilon^2})

. Then with probability

1-\delta

, we have for all

\mathbf{x}\in\mathbb{R}^{d}

(1-\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2 \leq \|\boldsymbol{S}\boldsymbol{A}\mathbf{x}\|_2^2 \leq (1+\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2

To prove this result, we will first massage the Subspace Embedding Guarantee into a more approachable form.

Lemma 1: Equivalent form of Subspace Embedding
Lemma

Let

\boldsymbol{U}\in\mathbb{R}^{n \times d}

be any matrix with orthonormal columns that space the columnspace of

\boldsymbol{A}

. Then the condition in Theorem 1 is equivalent to

\|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I}\|_{2}\leq\varepsilon

Proof. First, let $\boldsymbol{U}\in\mathbb{R}^{n \times d}$ be any matrix with orthonormal columns with the same columnspace as $\boldsymbol{A}$ (e.g. the $\boldsymbol{U}$ matrix in the SVD of $\boldsymbol{A}$ ). Then, for any $\mathbf{x}\in\mathbb{R}^d$ we have a one-to-one map to a vector $\mathbf{y}\in\mathbb{R}^d$ such that $\boldsymbol{A}\mathbf{x}=\boldsymbol{U}\mathbf{y}$ . So, we need to show that $\begin{aligned} (1-\varepsilon)\|\boldsymbol{U}\mathbf{y}\|_2^2 \leq \|\boldsymbol{S}\boldsymbol{U}\mathbf{y}\|_2^2 \leq (1+\varepsilon)\|\boldsymbol{U}\mathbf{y}\|_2^2 && \forall\mathbf{y}\in\mathbb{R}^d \end{aligned}$ Next, we expand the $\ell_2$ norms: $\begin{aligned} (1-\varepsilon)\mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{U}\mathbf{y} \leq \mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}\mathbf{y} \leq (1+\varepsilon)\mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{U}\mathbf{y} && \forall\mathbf{y}\in\mathbb{R}^d \end{aligned}$ Note that $\boldsymbol{U}^\intercal\boldsymbol{U}=\boldsymbol{I}$ by definition, so we get $\begin{aligned} (1-\varepsilon)\mathbf{y}^\intercal\mathbf{y} \leq \mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}\mathbf{y} \leq (1+\varepsilon)\mathbf{y}^\intercal\mathbf{y} && \forall\mathbf{y}\in\mathbb{R}^d \end{aligned}$ For $\mathbf{y}=0$ , the claim trivially holds, so assuming $\mathbf{y}\neq0$ we divide by $\mathbf{y}^\intercal\mathbf{y}$ on both sides: $\begin{aligned} (1-\varepsilon) \leq \frac{\mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}\mathbf{y}}{\mathbf{y}^\intercal\mathbf{y}} \leq (1+\varepsilon) && \forall\mathbf{y}\in\mathbb{R}^d \end{aligned}$ By the Courant–Fischer-Weyl Min-Max Principle, this is equivalent to guaranteeing that $\begin{aligned} 1-\varepsilon \leq \lambda_i(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}) \leq 1+\varepsilon && \forall i\in[d] \end{aligned}$ And rearranging the claim, this becomes $\begin{aligned} \left|\lambda_i(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}) - 1\right| \leq \varepsilon && \forall i\in[d] \end{aligned}$ Or, equivalently,

\|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I}\|_2 \leq \varepsilon

\blacksquare \, \,

Note that this guarantee seems reasonable. Below, we verify that $\mathbb{E}[\boldsymbol{S}^\intercal\boldsymbol{S}]=\boldsymbol{I}$ , so we have $\mathbb{E}[\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{U}\boldsymbol{S}] = \boldsymbol{U}^\intercal\boldsymbol{U} = \boldsymbol{I}$ , and thus $\|\mathbb{E}[\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}] - \boldsymbol{I}\|_2 = 0$ . In other words, this guarantee looks like a concentration of a random variable around its mean.

Lemma 2: Unbiased Expectation
Lemma

The Sample and Rescale matrix

\boldsymbol{S}

has

\mathbb{E}[\boldsymbol{S}^\intercal\boldsymbol{S}]=\boldsymbol{I}

Proof. First, we let $\mathbf{r}_i$ denote row $i$ of $\boldsymbol{S}$ :

\mathbf{r}_i = \begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}

We can equivalently define $\mathbf{r}$ entrywise, with an indicator variable: $[\mathbf{r}_i]_j = \frac1{\sqrt{k p_j}} \chi_{[s_i = j]}$ . By the outer-product form of matrix-matrix multiplication, note that

\mathbb{E}[\boldsymbol{S}^\intercal\boldsymbol{S}] = \mathbb{E}\left[\sum_{i=1}^k \mathbf{r}_i\mathbf{r}_i^\intercal \right] = k \mathbb{E}[\mathbf{r}_1\mathbf{r}_1^\intercal]

So, it suffices to analyze the expected value of $\mathbf{r}_1\mathbf{r}_1^\intercal$ . In particular, note that $\mathbf{r}_1\mathbf{r}_1^\intercal$ is a diagonal matrix with entries $\frac1{k p_j} \chi_{[s_i = j]}$ . This lets us compute the expected value of an arbitrary diagonal entry of $\mathbf{r}_1\mathbf{r}_1^\intercal$ then:

\mathbb{E}[[\mathbf{r}_1\mathbf{r}_1^\intercal]_{j,j}] = \mathbb{E}[\frac1{k p_j} \chi_{[s_i = j]}] = \frac1{k p_j} \Pr[s_i = j] = \frac1{k p_j} p_j = \frac1{k}

And so, we have $\mathbb{E}[\mathbf{r}_1\mathbf{r}_1^\intercal] = \frac1k\boldsymbol{I}$ , and thus $\mathbb{E}[\boldsymbol{S}^\intercal\boldsymbol{S}] = k\mathbb{E}[\mathbf{r}_1\mathbf{r}_1^\intercal]=\boldsymbol{I}$ .

\blacksquare \, \,

With this reformulation, we now turn to ensuring the Subspace Embedding guarantee $\|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I}\|_2 \leq \varepsilon$ In particular, we import the following theorem:

Theorem 2: Matrix Bernstein (Simplified)
Theorem

Let

\boldsymbol{R}_1,\ldots,\boldsymbol{R}_k\in\mathbb{R}^{d \times d}

be iid symmetric random matrices such that

\mathbb{E}[\boldsymbol{R}]=0

\|\boldsymbol{R}\|_2\leq\gamma

, and

\|\mathbb{E}[\boldsymbol{R}^2]\|_2\leq\sigma^2

. Then,

\Pr[\|\frac1k\sum_{i=1}^k \boldsymbol{R}_i\|_2 \geq \varepsilon] \leq 2de^{\frac{-k\varepsilon^2}{2\sigma^2+\frac{2}{3}\gamma\varepsilon}}

(This is a heavy simplification of Theorem 6.1.1 from Tropp (2015))

We now prove the main theorem:

Proof. We now decompose $\boldsymbol{I}-\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} = \frac1k\sum_{i=1}^k \boldsymbol{R}_i$ for some matrices $\boldsymbol{R}_i$ . First, note that the $i^{th}$ row of $\boldsymbol{S}\boldsymbol{U}$ is $\frac{1}{\sqrt{k p_{s_i}}} \mathbf{u}_{s_i}$ , where $\mathbf{u}_j$ denotes the $j^{th}$ row of $\boldsymbol{U}$ . In particular, by the outer-product view of matrix-matrix products, note that

\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} = (\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\boldsymbol{U}) = \sum_{i=1}^k \tfrac{1}{kp_{s_i}} \mathbf{u}_{s_i}\mathbf{u}_{s_i}^\intercal

Which then motivates us to let $\boldsymbol{R}_i \;{\vcentcolon=}\; \boldsymbol{I} - \frac{1}{p_{s_i}} \mathbf{u}_{s_i}\mathbf{u}_{s_i}^\intercal$ (notice $k$ is dropped from the denominator) be our iid matrices.

By the above equation, we know that $\frac1k\sum_{i=1}^k\boldsymbol{R}_i = \boldsymbol{I}-\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}$ .
$\mathbb{E}[\boldsymbol{R}] = \boldsymbol{I} - \sum_{j=1}^n p_j \frac1{p_j}\mathbf{u}_j\mathbf{u}_j^\intercal = \boldsymbol{I} - \sum_{j=1}^n \mathbf{u}_j\mathbf{u}_j^\intercal = \boldsymbol{0}$
For all $j\in[n]$ , $\|\boldsymbol{R}\|_2 \leq \|\boldsymbol{I}\|_2 + \frac1{p_j}\|\mathbf{u}_j\mathbf{u}_j^\intercal\|_2 = 1 + \frac{d}{\|\mathbf{u}_j\|_2^2} \|\mathbf{u}_j\|_2^2 = 1 + d$ (using Lemma 3 Part 4 from here)
We verify the variance term $\|\mathbb{E}[\boldsymbol{R}^2]\|_2 = (d-1)\boldsymbol{I}$ below:

\begin{aligned} \mathbb{E}[\boldsymbol{R}^2] &= \mathbb{E}[\boldsymbol{I} - \tfrac2{p_j} \mathbf{u}_j\mathbf{u}_j^\intercal + \tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \mathbb{E}[\tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[p_j \tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[\tfrac1{p_j} \|\mathbf{u}_j\|_2^2 \mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[\tfrac{d}{\|\mathbf{u}_j\|_2^2} \|\mathbf{u}_j\|_2^2 \mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + d\sum_{j=1}^n[\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + d\boldsymbol{I} \\ &= (d-1)\boldsymbol{I} \end{aligned}

And so, by the Matrix Bernstein bound,

\Pr[\|\boldsymbol{I}-\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}\|_2 \geq \varepsilon] \leq 2d e^{-\frac{k\varepsilon^2}{2(d-1) + \frac23 (d+1)\varepsilon}}

Plugging in $k = O(\frac{d \log(\frac d\delta)}{\varepsilon^2})$ makes the right hand side term at most $\delta$ , completing the proof.

\blacksquare \, \,

Oversampling: Using approximate leverage scores

A very important and common extension of row sampling is discussed here. Note that Theorem 1 only works, as stated, if we compute the leverage scores of $\boldsymbol{A}$ exactly. This is not necessary, and approximate leverage scores suffice, as characterized below:

Algorithm 2: Leverage Score Subspace Embedding

input: Matrix $\boldsymbol{A}\in\mathbb{R}^{n \times d}$ . Number $k$ of subsamples. Sampling probabilities $p_1,\ldots,p_n$ .

output: Sketched matrix $\tilde{\boldsymbol{A}}\in\mathbb{R}^{k \times d}$

Sample indices $s_1,\ldots,s_k\in[n]$ iid with respect to $p_1,\ldots,p_n$
Build the sample-and-rescale matrix $\boldsymbol{S}\in\mathbb{R}^{k \times n}$ :

Row $i$ of $\boldsymbol{S}$ has form $\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}$ , where index $s_i$ is the nonzero entry.
Return $\tilde{\boldsymbol{A}} = \boldsymbol{S}\boldsymbol{A}$

Theorem 3: Subspace Embedding (Oversampling)
Theorem

Fix

\varepsilon > 0

\delta > 0

. Let

\tau_1,\ldots,\tau_n

be the leverage scores for

\boldsymbol{A}\in\mathbb{R}^{n \times d}

, and let

\tilde\tau_1,\ldots,\tilde\tau_n

be overestimates of the leverage scores, so that

\tilde\tau_i \geq \tau_i

. Let

\tilde d \;{\vcentcolon=}\; \sum_{i=1}^n \tilde\tau_i

. Then run Algorithm 2 with

k = O(\frac{\tilde d\log(\frac d\delta)}{\varepsilon^2})

and sampling probabilities

p_i = \frac{\tilde\tau_i}{\tilde d}

. Then with probability

1-\delta

, we have for all

\mathbf{x}\in\mathbb{R}^{d}

(1-\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2 \leq \|\boldsymbol{S}\boldsymbol{A}\mathbf{x}\|_2^2 \leq (1+\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2

Proof. This proof almost exactly matches that of Theorem 1, except we bound a couple terms from the Matrix Bernstein a little differently:

$\|\boldsymbol{R}\|_2 \leq \|\boldsymbol{I}\|_2 + \frac1{p_j}\|\mathbf{u}_j\mathbf{u}_j^\intercal\|_2 = 1 + \frac{\tilde d}{\tilde\tau_i} \|\mathbf{u}_j\|_2^2 = 1 + \frac{\tilde d}{\tilde \tau_i} \tau_i \leq 1 + \tilde d$
And the variance analysis:

$\begin{aligned} \mathbb{E}[\boldsymbol{R}^2] &= \mathbb{E}[\boldsymbol{I} - \tfrac2{p_j} \mathbf{u}_j\mathbf{u}_j^\intercal + \tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \mathbb{E}[\tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[p_j \tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[\tfrac1{p_j} \|\mathbf{u}_j\|_2^2 \mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[\tfrac{\tilde d}{\tilde\tau_i} \tau_i \mathbf{u}_j\mathbf{u}_j^\intercal] \\ &\leq \boldsymbol{I} - 2\boldsymbol{I} + \tilde d\sum_{j=1}^n[\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \tilde d\boldsymbol{I} \\ &= (\tilde d-1)\boldsymbol{I} \end{aligned}$ From which the result follows by Matrix Bernstein.

\blacksquare \, \,

Recall that the true leverage scores have $\sum_{i=1}^n\tau_i = d$ , so $\tilde d$ in Theorem 3 measures how inaccurate our approximate leverage scores are. A nice example to pair with this theorem is how it explain when uniform sampling works:

Corollary 1: Subspace Embedding via Uniform Sampling
Corollary

Let

\tau^* \;{\vcentcolon=}\; \max_{i\in[n]} \tau_i(\boldsymbol{A})

. Then run Algorithm 2 with

k = O(\frac{n\tau^* \log(\frac{d}{\delta})}{\varepsilon^2})

by with uniform probabilities

p_i = \frac1n

. Then, with probability

1-\delta

, we have for all

\mathbf{x}\in\mathbb{R}^{d}

(1-\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2 \leq \|\boldsymbol{S}\boldsymbol{A}\mathbf{x}\|_2^2 \leq (1+\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2

Matrices with large $\tau^*\approx1$ have a unique row that is not well represented by any other row in $\boldsymbol{A}$ , so $\tilde O(n)$ row samples are necessary to sample that one unique row. Matrices with low $\tau^*\approx\frac1n$ have rows that all looks almost equivalent, so just sampling $\tilde O(1)$ rows suffices to effectively recover all of $\boldsymbol{A}$ . This explain when uniform sampling can be better or worse than leverage score sampling, depending on the exact matrix $\boldsymbol{A}$ . This is also why (in)coherence assumptions often appear in the Matrix Completion literature.

Bibliography

Tropp. An Introduction to Matrix Concentration Inequalities. Foundations and Trends in Machine Learning 2015.

Subspace Embedding via Leverage Score Sampling

Theorem 1: Subspace Embedding Theorem

Lemma 1: Equivalent form of Subspace Embedding Lemma

Lemma 2: Unbiased Expectation Lemma

Theorem 2: Matrix Bernstein (Simplified) Theorem

Oversampling: Using approximate leverage scores

Theorem 3: Subspace Embedding (Oversampling) Theorem

Corollary 1: Subspace Embedding via Uniform Sampling Corollary

See Also

Bibliography

Theorem 1: Subspace Embedding
Theorem

Lemma 1: Equivalent form of Subspace Embedding
Lemma

Lemma 2: Unbiased Expectation
Lemma

Theorem 2: Matrix Bernstein (Simplified)
Theorem

Theorem 3: Subspace Embedding (Oversampling)
Theorem

Corollary 1: Subspace Embedding via Uniform Sampling
Corollary