Sample indices s1,…,sk∈[n] iid with respect to the leverage scores of A
Let pi:=dτi denote the probability of sampling row i
Build the sample-and-rescale matrix S∈Rk×n:
Row i of S has form [00⋯0kpsi10⋯0], where index si is the nonzero entry.
Return A~=SA
This algorithm is equivalent to building A~ by sampling k rows of A with respect to leverage scores, rescaling those rows by kpsi1, and stacking those rows up. The matrix S is just a formalization that makes it easier to analyze this algorithm.
Let U∈Rn×d be any matrix with orthonormal columns that space the columnspace of A. Then the condition in Theorem 1 is equivalent to ∥U⊺S⊺SU−I∥2≤ε
Proof. First, let U∈Rn×d be any matrix with orthonormal columns with the same columnspace as A (e.g. the U matrix in the SVD of A). Then, for any x∈Rd we have a one-to-one map to a vector y∈Rd such that Ax=Uy. So, we need to show that (1−ε)∥Uy∥22≤∥SUy∥22≤(1+ε)∥Uy∥22∀y∈Rd Next, we expand the ℓ2 norms: (1−ε)y⊺U⊺Uy≤y⊺U⊺S⊺SUy≤(1+ε)y⊺U⊺Uy∀y∈Rd Note that U⊺U=I by definition, so we get (1−ε)y⊺y≤y⊺U⊺S⊺SUy≤(1+ε)y⊺y∀y∈Rd For y=0, the claim trivially holds, so assuming y=0 we divide by y⊺y on both sides: (1−ε)≤y⊺yy⊺U⊺S⊺SUy≤(1+ε)∀y∈Rd By the Courant–Fischer-Weyl Min-Max Principle, this is equivalent to guaranteeing that 1−ε≤λi(U⊺S⊺SU)≤1+ε∀i∈[d] And rearranging the claim, this becomes ∣λi(U⊺S⊺SU)−1∣≤ε∀i∈[d] Or, equivalently,
∥U⊺S⊺SU−I∥2≤ε
■
Note that this guarantee seems reasonable. Below, we verify that E[S⊺S]=I, so we have E[U⊺S⊺US]=U⊺U=I, and thus ∥E[U⊺S⊺SU]−I∥2=0. In other words, this guarantee looks like a concentration of a random variable around its mean.
We can equivalently define r entrywise, with an indicator variable: [ri]j=kpj1χ[si=j]. By the outer-product form of matrix-matrix multiplication, note that
E[S⊺S]=E[i=1∑kriri⊺]=kE[r1r1⊺]
So, it suffices to analyze the expected value of r1r1⊺. In particular, note that r1r1⊺ is a diagonal matrix with entries kpj1χ[si=j]. This lets us compute the expected value of an arbitrary diagonal entry of r1r1⊺ then:
Let R1,…,Rk∈Rd×d be iid symmetric random matrices such that E[R]=0, ∥R∥2≤γ, and ∥E[R2]∥2≤σ2. Then, Pr[∥k1i=1∑kRi∥2≥ε]≤2de2σ2+32γε−kε2
(This is a heavy simplification of Theorem 6.1.1 from Tropp (2015))
We now prove the main theorem:
Proof. We now decompose I−U⊺S⊺SU=k1∑i=1kRi for some matrices Ri. First, note that the ith row of SU is kpsi1usi, where uj denotes the jth row of U. In particular, by the outer-product view of matrix-matrix products, note that
U⊺S⊺SU=(SU)⊺(SU)=i=1∑kkpsi1usiusi⊺
Which then motivates us to let Ri:=I−psi1usiusi⊺ (notice k is dropped from the denominator) be our iid matrices.
By the above equation, we know that k1∑i=1kRi=I−U⊺S⊺SU.
E[R]=I−∑j=1npjpj1ujuj⊺=I−∑j=1nujuj⊺=0
For all j∈[n], ∥R∥2≤∥I∥2+pj1∥ujuj⊺∥2=1+∥uj∥22d∥uj∥22=1+d (using Lemma 3 Part 4 from here)
We verify the variance term ∥E[R2]∥2=(d−1)I below:
A very important and common extension of row sampling is discussed here. Note that Theorem 1 only works, as stated, if we compute the leverage scores of A exactly. This is not necessary, and approximate leverage scores suffice, as characterized below:
Fix ε>0, δ>0. Let τ1,…,τn be the leverage scores for A∈Rn×d, and let τ~1,…,τ~n be overestimates of the leverage scores, so that τ~i≥τi. Let d~:=∑i=1nτ~i. Then run Algorithm 2 with k=O(ε2d~log(δd)) and sampling probabilities pi=d~τ~i. Then with probability 1−δ, we have for allx∈Rd(1−ε)∥Ax∥22≤∥SAx∥22≤(1+ε)∥Ax∥22
Proof. This proof almost exactly matches that of Theorem 1, except we bound a couple terms from the Matrix Bernstein a little differently:
E[R2]=E[I−pj2ujuj⊺+pj21ujuj⊺ujuj⊺]=I−2I+E[pj21ujuj⊺ujuj⊺]=I−2I+j=1∑n[pjpj21ujuj⊺ujuj⊺]=I−2I+j=1∑n[pj1∥uj∥22ujuj⊺]=I−2I+j=1∑n[τ~id~τiujuj⊺]≤I−2I+d~j=1∑n[ujuj⊺]=I−2I+d~I=(d~−1)I From which the result follows by Matrix Bernstein.
■
Recall that the true leverage scores have ∑i=1nτi=d, so d~ in Theorem 3 measures how inaccurate our approximate leverage scores are. A nice example to pair with this theorem is how it explain when uniform sampling works:
Let τ∗:=maxi∈[n]τi(A). Then run Algorithm 2 with k=O(ε2nτ∗log(δd)) by with uniform probabilities pi=n1. Then, with probability 1−δ, we have for allx∈Rd(1−ε)∥Ax∥22≤∥SAx∥22≤(1+ε)∥Ax∥22
Matrices with large τ∗≈1 have a unique row that is not well represented by any other row in A, so O~(n) row samples are necessary to sample that one unique row. Matrices with low τ∗≈n1 have rows that all looks almost equivalent, so just sampling O~(1) rows suffices to effectively recover all of A. This explain when uniform sampling can be better or worse than leverage score sampling, depending on the exact matrix A. This is also why (in)coherence assumptions often appear in the Matrix Completion literature.