Subspace Embedding via Leverage Score Sampling

On this page, we show how sampling a matrix via its leverage scores creates a subspace embedding guarantee. Subspace embeddings are core to much of RandNLA, so many papers either assume they are given a subspace embedding, or create a proof much like the one below (just fine-tuned to research their setting).

Prerequisite: Basic Properties of Leverage Scores

We consider the following algorithm:

Algorithm 1: Leverage Score Subspace Embedding


input: Matrix ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d}. Number kk of subsamples.

output: Sketched matrix A~Rk×d\tilde{\boldsymbol{A}}\in\mathbb{R}^{k \times d}

  1. Sample indices s1,,sk[n]s_1,\ldots,s_k\in[n] iid with respect to the leverage scores of A\boldsymbol{A}

  2. Let pi := τidp_i \;{\vcentcolon=}\; \frac{\tau_i}{d} denote the probability of sampling row ii

  3. Build the sample-and-rescale matrix SRk×n\boldsymbol{S}\in\mathbb{R}^{k \times n}:

    Row ii of S\boldsymbol{S} has form [0001kpsi00]\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}, where index sis_i is the nonzero entry.

  4. Return A~=SA\tilde{\boldsymbol{A}} = \boldsymbol{S}\boldsymbol{A}

This algorithm is equivalent to building A~\tilde{\boldsymbol{A}} by sampling kk rows of A\boldsymbol{A} with respect to leverage scores, rescaling those rows by 1kpsi\frac{1}{\sqrt{k p_{s_i}}}, and stacking those rows up. The matrix S\boldsymbol{S} is just a formalization that makes it easier to analyze this algorithm.

Theorem 1: Subspace Embedding
Theorem

Fix ε>0\varepsilon > 0, δ>0\delta > 0 and run the above algorithm with k=O(dlog(dδ)ε2)k = O(\frac{d\log(\frac d\delta)}{\varepsilon^2}). Then with probability 1δ1-\delta, we have for all xRd\mathbf{x}\in\mathbb{R}^{d}

(1ε)Ax22SAx22(1+ε)Ax22 (1-\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2 \leq \|\boldsymbol{S}\boldsymbol{A}\mathbf{x}\|_2^2 \leq (1+\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2


To prove this result, we will first massage the Subspace Embedding Guarantee into a more approachable form.

Lemma 1: Equivalent form of Subspace Embedding
Lemma

Let URn×d\boldsymbol{U}\in\mathbb{R}^{n \times d} be any matrix with orthonormal columns that space the columnspace of A\boldsymbol{A}. Then the condition in Theorem 1 is equivalent to

USSUI2ε \|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I}\|_{2}\leq\varepsilon

Proof. First, let URn×d\boldsymbol{U}\in\mathbb{R}^{n \times d} be any matrix with orthonormal columns with the same columnspace as A\boldsymbol{A} (e.g. the U\boldsymbol{U} matrix in the SVD of A\boldsymbol{A}). Then, for any xRd\mathbf{x}\in\mathbb{R}^d we have a one-to-one map to a vector yRd\mathbf{y}\in\mathbb{R}^d such that Ax=Uy\boldsymbol{A}\mathbf{x}=\boldsymbol{U}\mathbf{y}. So, we need to show that (1ε)Uy22SUy22(1+ε)Uy22yRd\begin{aligned} (1-\varepsilon)\|\boldsymbol{U}\mathbf{y}\|_2^2 \leq \|\boldsymbol{S}\boldsymbol{U}\mathbf{y}\|_2^2 \leq (1+\varepsilon)\|\boldsymbol{U}\mathbf{y}\|_2^2 && \forall\mathbf{y}\in\mathbb{R}^d \end{aligned} Next, we expand the 2\ell_2 norms: (1ε)yUUyyUSSUy(1+ε)yUUyyRd\begin{aligned} (1-\varepsilon)\mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{U}\mathbf{y} \leq \mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}\mathbf{y} \leq (1+\varepsilon)\mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{U}\mathbf{y} && \forall\mathbf{y}\in\mathbb{R}^d \end{aligned} Note that UU=I\boldsymbol{U}^\intercal\boldsymbol{U}=\boldsymbol{I} by definition, so we get (1ε)yyyUSSUy(1+ε)yyyRd\begin{aligned} (1-\varepsilon)\mathbf{y}^\intercal\mathbf{y} \leq \mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}\mathbf{y} \leq (1+\varepsilon)\mathbf{y}^\intercal\mathbf{y} && \forall\mathbf{y}\in\mathbb{R}^d \end{aligned} For y=0\mathbf{y}=0, the claim trivially holds, so assuming y0\mathbf{y}\neq0 we divide by yy\mathbf{y}^\intercal\mathbf{y} on both sides: (1ε)yUSSUyyy(1+ε)yRd\begin{aligned} (1-\varepsilon) \leq \frac{\mathbf{y}^\intercal\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}\mathbf{y}}{\mathbf{y}^\intercal\mathbf{y}} \leq (1+\varepsilon) && \forall\mathbf{y}\in\mathbb{R}^d \end{aligned} By the Courant–Fischer-Weyl Min-Max Principle, this is equivalent to guaranteeing that 1ελi(USSU)1+εi[d]\begin{aligned} 1-\varepsilon \leq \lambda_i(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}) \leq 1+\varepsilon && \forall i\in[d] \end{aligned} And rearranging the claim, this becomes λi(USSU)1εi[d]\begin{aligned} \left|\lambda_i(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}) - 1\right| \leq \varepsilon && \forall i\in[d] \end{aligned} Or, equivalently,

USSUI2ε \|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I}\|_2 \leq \varepsilon
\blacksquare \, \,

Note that this guarantee seems reasonable. Below, we verify that E[SS]=I\mathbb{E}[\boldsymbol{S}^\intercal\boldsymbol{S}]=\boldsymbol{I}, so we have E[USUS]=UU=I\mathbb{E}[\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{U}\boldsymbol{S}] = \boldsymbol{U}^\intercal\boldsymbol{U} = \boldsymbol{I}, and thus E[USSU]I2=0\|\mathbb{E}[\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}] - \boldsymbol{I}\|_2 = 0. In other words, this guarantee looks like a concentration of a random variable around its mean.

Lemma 2: Unbiased Expectation
Lemma

The Sample and Rescale matrix S\boldsymbol{S} has E[SS]=I\mathbb{E}[\boldsymbol{S}^\intercal\boldsymbol{S}]=\boldsymbol{I}.

Proof. First, we let ri\mathbf{r}_i denote row ii of S\boldsymbol{S}:

ri=[0001kpsi00] \mathbf{r}_i = \begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}

We can equivalently define r\mathbf{r} entrywise, with an indicator variable: [ri]j=1kpjχ[si=j][\mathbf{r}_i]_j = \frac1{\sqrt{k p_j}} \chi_{[s_i = j]}. By the outer-product form of matrix-matrix multiplication, note that

E[SS]=E[i=1kriri]=kE[r1r1] \mathbb{E}[\boldsymbol{S}^\intercal\boldsymbol{S}] = \mathbb{E}\left[\sum_{i=1}^k \mathbf{r}_i\mathbf{r}_i^\intercal \right] = k \mathbb{E}[\mathbf{r}_1\mathbf{r}_1^\intercal]

So, it suffices to analyze the expected value of r1r1\mathbf{r}_1\mathbf{r}_1^\intercal. In particular, note that r1r1\mathbf{r}_1\mathbf{r}_1^\intercal is a diagonal matrix with entries 1kpjχ[si=j]\frac1{k p_j} \chi_{[s_i = j]}. This lets us compute the expected value of an arbitrary diagonal entry of r1r1\mathbf{r}_1\mathbf{r}_1^\intercal then:

E[[r1r1]j,j]=E[1kpjχ[si=j]]=1kpjPr[si=j]=1kpjpj=1k \mathbb{E}[[\mathbf{r}_1\mathbf{r}_1^\intercal]_{j,j}] = \mathbb{E}[\frac1{k p_j} \chi_{[s_i = j]}] = \frac1{k p_j} \Pr[s_i = j] = \frac1{k p_j} p_j = \frac1{k}

And so, we have E[r1r1]=1kI\mathbb{E}[\mathbf{r}_1\mathbf{r}_1^\intercal] = \frac1k\boldsymbol{I}, and thus E[SS]=kE[r1r1]=I\mathbb{E}[\boldsymbol{S}^\intercal\boldsymbol{S}] = k\mathbb{E}[\mathbf{r}_1\mathbf{r}_1^\intercal]=\boldsymbol{I}.

\blacksquare \, \,


With this reformulation, we now turn to ensuring the Subspace Embedding guarantee USSUI2ε\|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I}\|_2 \leq \varepsilon In particular, we import the following theorem:

Theorem 2: Matrix Bernstein (Simplified)
Theorem

Let R1,,RkRd×d\boldsymbol{R}_1,\ldots,\boldsymbol{R}_k\in\mathbb{R}^{d \times d} be iid symmetric random matrices such that E[R]=0\mathbb{E}[\boldsymbol{R}]=0, R2γ\|\boldsymbol{R}\|_2\leq\gamma, and E[R2]2σ2\|\mathbb{E}[\boldsymbol{R}^2]\|_2\leq\sigma^2. Then,

Pr[1ki=1kRi2ε]2dekε22σ2+23γε \Pr[\|\frac1k\sum_{i=1}^k \boldsymbol{R}_i\|_2 \geq \varepsilon] \leq 2de^{\frac{-k\varepsilon^2}{2\sigma^2+\frac{2}{3}\gamma\varepsilon}}

(This is a heavy simplification of Theorem 6.1.1 from Tropp (2015))

We now prove the main theorem:

Proof. We now decompose IUSSU=1ki=1kRi\boldsymbol{I}-\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} = \frac1k\sum_{i=1}^k \boldsymbol{R}_i for some matrices Ri\boldsymbol{R}_i. First, note that the ithi^{th} row of SU\boldsymbol{S}\boldsymbol{U} is 1kpsiusi\frac{1}{\sqrt{k p_{s_i}}} \mathbf{u}_{s_i}, where uj\mathbf{u}_j denotes the jthj^{th} row of U\boldsymbol{U}. In particular, by the outer-product view of matrix-matrix products, note that

USSU=(SU)(SU)=i=1k1kpsiusiusi \boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} = (\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\boldsymbol{U}) = \sum_{i=1}^k \tfrac{1}{kp_{s_i}} \mathbf{u}_{s_i}\mathbf{u}_{s_i}^\intercal

Which then motivates us to let Ri := I1psiusiusi\boldsymbol{R}_i \;{\vcentcolon=}\; \boldsymbol{I} - \frac{1}{p_{s_i}} \mathbf{u}_{s_i}\mathbf{u}_{s_i}^\intercal (notice kk is dropped from the denominator) be our iid matrices.

  • By the above equation, we know that 1ki=1kRi=IUSSU\frac1k\sum_{i=1}^k\boldsymbol{R}_i = \boldsymbol{I}-\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}.

  • E[R]=Ij=1npj1pjujuj=Ij=1nujuj=0\mathbb{E}[\boldsymbol{R}] = \boldsymbol{I} - \sum_{j=1}^n p_j \frac1{p_j}\mathbf{u}_j\mathbf{u}_j^\intercal = \boldsymbol{I} - \sum_{j=1}^n \mathbf{u}_j\mathbf{u}_j^\intercal = \boldsymbol{0}

  • For all j[n]j\in[n], R2I2+1pjujuj2=1+duj22uj22=1+d\|\boldsymbol{R}\|_2 \leq \|\boldsymbol{I}\|_2 + \frac1{p_j}\|\mathbf{u}_j\mathbf{u}_j^\intercal\|_2 = 1 + \frac{d}{\|\mathbf{u}_j\|_2^2} \|\mathbf{u}_j\|_2^2 = 1 + d (using Lemma 3 Part 4 from here)

  • We verify the variance term E[R2]2=(d1)I\|\mathbb{E}[\boldsymbol{R}^2]\|_2 = (d-1)\boldsymbol{I} below:

E[R2]=E[I2pjujuj+1pj2ujujujuj]=I2I+E[1pj2ujujujuj]=I2I+j=1n[pj1pj2ujujujuj]=I2I+j=1n[1pjuj22ujuj]=I2I+j=1n[duj22uj22ujuj]=I2I+dj=1n[ujuj]=I2I+dI=(d1)I\begin{aligned} \mathbb{E}[\boldsymbol{R}^2] &= \mathbb{E}[\boldsymbol{I} - \tfrac2{p_j} \mathbf{u}_j\mathbf{u}_j^\intercal + \tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \mathbb{E}[\tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[p_j \tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[\tfrac1{p_j} \|\mathbf{u}_j\|_2^2 \mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[\tfrac{d}{\|\mathbf{u}_j\|_2^2} \|\mathbf{u}_j\|_2^2 \mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + d\sum_{j=1}^n[\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + d\boldsymbol{I} \\ &= (d-1)\boldsymbol{I} \end{aligned}

And so, by the Matrix Bernstein bound,

Pr[IUSSU2ε]2dekε22(d1)+23(d+1)ε \Pr[\|\boldsymbol{I}-\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}\|_2 \geq \varepsilon] \leq 2d e^{-\frac{k\varepsilon^2}{2(d-1) + \frac23 (d+1)\varepsilon}}

Plugging in k=O(dlog(dδ)ε2)k = O(\frac{d \log(\frac d\delta)}{\varepsilon^2}) makes the right hand side term at most δ\delta, completing the proof.

\blacksquare \, \,

Oversampling: Using approximate leverage scores

A very important and common extension of row sampling is discussed here. Note that Theorem 1 only works, as stated, if we compute the leverage scores of A\boldsymbol{A} exactly. This is not necessary, and approximate leverage scores suffice, as characterized below:

Algorithm 2: Leverage Score Subspace Embedding


input: Matrix ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d}. Number kk of subsamples. Sampling probabilities p1,,pnp_1,\ldots,p_n.

output: Sketched matrix A~Rk×d\tilde{\boldsymbol{A}}\in\mathbb{R}^{k \times d}

  1. Sample indices s1,,sk[n]s_1,\ldots,s_k\in[n] iid with respect to p1,,pnp_1,\ldots,p_n

  2. Build the sample-and-rescale matrix SRk×n\boldsymbol{S}\in\mathbb{R}^{k \times n}:

    Row ii of S\boldsymbol{S} has form [0001kpsi00]\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}, where index sis_i is the nonzero entry.

  3. Return A~=SA\tilde{\boldsymbol{A}} = \boldsymbol{S}\boldsymbol{A}

 

Theorem 3: Subspace Embedding (Oversampling)
Theorem

Fix ε>0\varepsilon > 0, δ>0\delta > 0. Let τ1,,τn\tau_1,\ldots,\tau_n be the leverage scores for ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d}, and let τ~1,,τ~n\tilde\tau_1,\ldots,\tilde\tau_n be overestimates of the leverage scores, so that τ~iτi\tilde\tau_i \geq \tau_i. Let d~ := i=1nτ~i\tilde d \;{\vcentcolon=}\; \sum_{i=1}^n \tilde\tau_i. Then run Algorithm 2 with k=O(d~log(dδ)ε2)k = O(\frac{\tilde d\log(\frac d\delta)}{\varepsilon^2}) and sampling probabilities pi=τ~id~p_i = \frac{\tilde\tau_i}{\tilde d}. Then with probability 1δ1-\delta, we have for all xRd\mathbf{x}\in\mathbb{R}^{d}

(1ε)Ax22SAx22(1+ε)Ax22 (1-\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2 \leq \|\boldsymbol{S}\boldsymbol{A}\mathbf{x}\|_2^2 \leq (1+\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2

Proof. This proof almost exactly matches that of Theorem 1, except we bound a couple terms from the Matrix Bernstein a little differently:

  • R2I2+1pjujuj2=1+d~τ~iuj22=1+d~τ~iτi1+d~\|\boldsymbol{R}\|_2 \leq \|\boldsymbol{I}\|_2 + \frac1{p_j}\|\mathbf{u}_j\mathbf{u}_j^\intercal\|_2 = 1 + \frac{\tilde d}{\tilde\tau_i} \|\mathbf{u}_j\|_2^2 = 1 + \frac{\tilde d}{\tilde \tau_i} \tau_i \leq 1 + \tilde d

  • And the variance analysis:

E[R2]=E[I2pjujuj+1pj2ujujujuj]=I2I+E[1pj2ujujujuj]=I2I+j=1n[pj1pj2ujujujuj]=I2I+j=1n[1pjuj22ujuj]=I2I+j=1n[d~τ~iτiujuj]I2I+d~j=1n[ujuj]=I2I+d~I=(d~1)I\begin{aligned} \mathbb{E}[\boldsymbol{R}^2] &= \mathbb{E}[\boldsymbol{I} - \tfrac2{p_j} \mathbf{u}_j\mathbf{u}_j^\intercal + \tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \mathbb{E}[\tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[p_j \tfrac1{p_j^2} \mathbf{u}_j\mathbf{u}_j^\intercal\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[\tfrac1{p_j} \|\mathbf{u}_j\|_2^2 \mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \sum_{j=1}^n[\tfrac{\tilde d}{\tilde\tau_i} \tau_i \mathbf{u}_j\mathbf{u}_j^\intercal] \\ &\leq \boldsymbol{I} - 2\boldsymbol{I} + \tilde d\sum_{j=1}^n[\mathbf{u}_j\mathbf{u}_j^\intercal] \\ &= \boldsymbol{I} - 2\boldsymbol{I} + \tilde d\boldsymbol{I} \\ &= (\tilde d-1)\boldsymbol{I} \end{aligned} From which the result follows by Matrix Bernstein.

\blacksquare \, \,

Recall that the true leverage scores have i=1nτi=d\sum_{i=1}^n\tau_i = d, so d~\tilde d in Theorem 3 measures how inaccurate our approximate leverage scores are. A nice example to pair with this theorem is how it explain when uniform sampling works:

Corollary 1: Subspace Embedding via Uniform Sampling
Corollary

Let τ := maxi[n]τi(A)\tau^* \;{\vcentcolon=}\; \max_{i\in[n]} \tau_i(\boldsymbol{A}). Then run Algorithm 2 with k=O(nτlog(dδ)ε2)k = O(\frac{n\tau^* \log(\frac{d}{\delta})}{\varepsilon^2}) by with uniform probabilities pi=1np_i = \frac1n. Then, with probability 1δ1-\delta, we have for all xRd\mathbf{x}\in\mathbb{R}^{d}

(1ε)Ax22SAx22(1+ε)Ax22 (1-\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2 \leq \|\boldsymbol{S}\boldsymbol{A}\mathbf{x}\|_2^2 \leq (1+\varepsilon)\|\boldsymbol{A}\mathbf{x}\|_2^2

Matrices with large τ1\tau^*\approx1 have a unique row that is not well represented by any other row in A\boldsymbol{A}, so O~(n)\tilde O(n) row samples are necessary to sample that one unique row. Matrices with low τ1n\tau^*\approx\frac1n have rows that all looks almost equivalent, so just sampling O~(1)\tilde O(1) rows suffices to effectively recover all of A\boldsymbol{A}. This explain when uniform sampling can be better or worse than leverage score sampling, depending on the exact matrix A\boldsymbol{A}. This is also why (in)coherence assumptions often appear in the Matrix Completion literature.

See Also

There are many extensions and implementations of subspace embedding in the literature:

  • Let me know if anything is missing

Bibliography