input: Matrix A∈Rn×d. Query access to b∈Rn. Number k of subsamples.
output: Approximate x~∈Rk×d
Sample indices s1,…,sk∈[n] iid with respect to the leverage scores of A
Let pi:=dτi denote the probability of sampling row i
Build the sample-and-rescale matrix S∈Rk×n:
Row i of S has form [00⋯0kpsi10⋯0], where index si is the nonzero entry.
Return x~=argminx∥SA−Sb∥2
Crucially, this algorithm only looks at the rows of b selected by the sampling matrix, and this sampling matrix does not know anything about b. Hence, the algorithm fits in the active regression model. We show the following:
Let A∈Rn×d be a tall-and-skinny matrix, and let b∈Rn. Then let x~ be the result of Algorithm 1 run with k=Ω(dlog(δd)+δεd). With probability 1−δ, we then have ∥Ax~−b∥2≤(1+ε)x∈Rdmin∥Ax−b∥2
Proof. Let x∗:=argminx∥Ax−b∥2 be the true solution to the full L2 regression problem. Recall by setting the derivative of ∥Ax−b∥22 to be zero, we find that A⊺Ax∗−A⊺b=0. Equivalently, A⊺(Ax∗−b)=0, or in other words, Ax∗−b is orthogonal to the range of A. In particular, Ax∗−b is orthogonal to Ax~−Ax∗, and so we get
∥Ax~−b∥22=∥Ax∗−b∥22+∥Ax~−Ax∗∥22
Let U∈Rn×d be an orthogonal matrix than spans the range of A (e.g. the U matrix from the SVD of A). Then, for every x∈Rd there exists some y∈Rd such that Ax=Uy. Let y~ and y∗ be the vectors so that Ax~=Uy~ and Ax∗=Uy∗. Then, the above inequality can be written as
∥Uy~−b∥22=∥Uy∗−b∥22+∥Uy~−Uy∗∥22
Since U is an orthogonal matrix, we have ∥Uy~−Uy∗∥22=∥y~−y∗∥22. Our goal is to bound the right hand side by (1+ε)∥Ax∗−b∥22, and so it suffices to show that ∥y~−y∗∥22≤ε∥Uy∗−b∥. Recall from Lemma 1 of the Subspace Embedding Page that since k=Ω(dlog(δd)), with probability 1−2δ, we have ∥U⊺S⊺SU−I∥2≤21. Then,
Notably, this proof uses the sampling probabilities in only two places: showing the subspace embedding guarantee ∥U⊺S⊺SU−I∥2≤21 and showing the fast matrix multiplication ∥(SU)⊺S(Uy∗−b)∥2≤2dε∥U∥F∥Uy∗−b∥2. Both of these proofs work with inexact leverage scores sampling probabilities, namely via oversampling, as shown on their respective pages (Theorem 3 here and Corollary 1 here). We can therefore state a more general algorithm and guarantee for leverage score sampling with the exact same proof analysis:
Fix ε>0,δ>0. Let τ1,…,τn be the leverage scores of A∈Rn×d, and let τ~1,…,τ~n be overestimates of the leverage scores, so that τ~i≥τi. Let d~:=∑i=1nτ~i. Then let x~ be the output of Theorem 2 with k=Ω(d~log(δd)+εδd~) and probabilities pi=d~τ~i With probability 1−δ, we then have ∥Ax~−b∥22≤(1+ε)x∈Rdmin∥Ax−b∥22
Proof. To get the subspace embedding, we directly appeal to Theorem 3 here. We trace through the fast matrix multiplication more carefully. We appeal to Corollary 1 here with error tolerance ε0=dε, and with T=d~ since τ~i overestimate the row norms of U. So, we need to sample
The proofs above are a near-exact copy of an unpublished note by Christopher Musco, and it closely tracks an analysis of Woodruff (2014) which studies regression under oblivious sketching instead of leverage score sampling.
There are resources with similar analyses in the literature:
(Chen Price (2019)) Explicitly studies active regression under Leverage Functions, which relates to Linear Operators, more generally than just matrices.
(Musco et al. (2019)) Studies active regression beyond the ℓ2 norm, like in the ℓp norm.
(Meyer Musco (2020)) The appendix studies active ℓ2 regression for both matrices and operators, but where we choose one of many design matrices A1,…,Aq, and can only achieve constant factor error (i.e. ε>Ω(1)).
(Kapralov et al. (2023)) Contains an analysis of active ℓ2 regression, but where we choose one of many design matrices A1,…,Aq, and can achieve (1+ε) factor error, but with a O~(d~2) dependence.