Active L2L_2 Regression via Leverage Score Sampling

Consider trying to solve minxAxb2\min_{\mathbf{x}} \|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2, but under the active regression regime. That is, we see the full matrix A\boldsymbol{A} a priori, but do not know the corresponding entries of b\mathbf{b}. We can observe entries of b\mathbf{b}, but this is expensive, and we want to read as few entries as possible.

Prerequisite: Subspace Embedding via Leverage Score Sampling (Matrix Bernstein)

Prerequisite: Fast Matrix-Matrix Multiplication

Without this constraint on the information in b\mathbf{b}, solving minxAxb2\min_{\mathbf{x}} \|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2 from subsampling a small number of rows is relatively straightforward: build a subspace embedding for the rows of the matrix Aˉ := [A b]\bar{\boldsymbol{A}} \;{\vcentcolon=}\; [\boldsymbol{A} \, \mathbf{b}], sampling by the leverage scores of the expanded matrix.

We will analyze the following algorithm:

Algorithm 1: Leverage Scores for Active Regression


input: Matrix ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d}. Query access to bRn\mathbf{b}\in\mathbb{R}^{n}. Number kk of subsamples.

output: Approximate x~Rk×d\tilde{\mathbf{x}}\in\mathbb{R}^{k \times d}

  1. Sample indices s1,,sk[n]s_1,\ldots,s_k\in[n] iid with respect to the leverage scores of A\boldsymbol{A}

  2. Let pi := τidp_i \;{\vcentcolon=}\; \frac{\tau_i}{d} denote the probability of sampling row ii

  3. Build the sample-and-rescale matrix SRk×n\boldsymbol{S}\in\mathbb{R}^{k \times n}:

    Row ii of S\boldsymbol{S} has form [0001kpsi00]\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}, where index sis_i is the nonzero entry.

  4. Return x~=arg minxSASb2\tilde{\mathbf{x}} = \argmin_{\mathbf{x}}\|\boldsymbol{S}\boldsymbol{A}-\boldsymbol{S}\mathbf{b}\|_2

Crucially, this algorithm only looks at the rows of b\mathbf{b} selected by the sampling matrix, and this sampling matrix does not know anything about b\mathbf{b}. Hence, the algorithm fits in the active regression model. We show the following:

Theorem 1: Active L2L_2 Regression
Theorem

Let ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d} be a tall-and-skinny matrix, and let bRn\mathbf{b}\in\mathbb{R}^{n}. Then let x~\tilde{\mathbf{x}} be the result of Algorithm 1 run with k=Ω(dlog(dδ)+dδε)k = \Omega(d \log(\frac d\delta) + \frac{d}{\delta\varepsilon}). With probability 1δ1-\delta, we then have

Ax~b2(1+ε)minxRdAxb2 \|\boldsymbol{A}\tilde{\mathbf{x}}-\mathbf{b}\|_2 \leq (1+\varepsilon) \min_{\mathbf{x}\in\mathbb{R}^{d}}\|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2

Proof. Let x := arg minxAxb2\mathbf{x}^* \;{\vcentcolon=}\; \argmin_{\mathbf{x}} \|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2 be the true solution to the full L2L_2 regression problem. Recall by setting the derivative of Axb22\|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2^2 to be zero, we find that AAxAb=0\boldsymbol{A}^\intercal\boldsymbol{A}\mathbf{x}^*-\boldsymbol{A}^\intercal\mathbf{b}=0. Equivalently, A(Axb)=0\boldsymbol{A}^\intercal(\boldsymbol{A}\mathbf{x}^*-\mathbf{b})=0, or in other words, Axb\boldsymbol{A}\mathbf{x}^*-\mathbf{b} is orthogonal to the range of A\boldsymbol{A}. In particular, Axb\boldsymbol{A}\mathbf{x}^*-\mathbf{b} is orthogonal to Ax~Ax\boldsymbol{A}\tilde{\mathbf{x}}-\boldsymbol{A}\mathbf{x}^*, and so we get

Ax~b22=Axb22+Ax~Ax22 \|\boldsymbol{A}\tilde{\mathbf{x}}-\mathbf{b}\|_2^2 = \|\boldsymbol{A}\mathbf{x}^*-\mathbf{b}\|_2^2 + \|\boldsymbol{A}\tilde{\mathbf{x}}-\boldsymbol{A}\mathbf{x}^*\|_2^2

Let URn×d\boldsymbol{U}\in\mathbb{R}^{n \times d} be an orthogonal matrix than spans the range of A\boldsymbol{A} (e.g. the U\boldsymbol{U} matrix from the SVD of A\boldsymbol{A}). Then, for every xRd\mathbf{x}\in\mathbb{R}^d there exists some yRd\mathbf{y}\in\mathbb{R}^d such that Ax=Uy\boldsymbol{A}\mathbf{x}=\boldsymbol{U}\mathbf{y}. Let y~\tilde{\mathbf{y}} and y\mathbf{y}^* be the vectors so that Ax~=Uy~\boldsymbol{A}\tilde{\mathbf{x}}=\boldsymbol{U}\tilde{\mathbf{y}} and Ax=Uy\boldsymbol{A}\mathbf{x}^*=\boldsymbol{U}\mathbf{y}^*. Then, the above inequality can be written as

Uy~b22=Uyb22+Uy~Uy22 \|\boldsymbol{U}\tilde{\mathbf{y}}-\mathbf{b}\|_2^2 = \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + \|\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{U}\mathbf{y}^*\|_2^2

Since U\boldsymbol{U} is an orthogonal matrix, we have Uy~Uy22=y~y22\|\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{U}\mathbf{y}^*\|_2^2 = \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2^2. Our goal is to bound the right hand side by (1+ε)Axb22(1+\varepsilon)\|\boldsymbol{A}\mathbf{x}^*-\mathbf{b}\|_2^2, and so it suffices to show that y~y22εUyb\|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2^2 \leq \varepsilon \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|. Recall from Lemma 1 of the Subspace Embedding Page that since k=Ω(dlog(dδ))k = \Omega(d \log(\frac d\delta)), with probability 1δ21-\frac\delta2, we have USSUI212\|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}-\boldsymbol{I}\|_2 \leq \frac12. Then,

y~y2(USSU)(y~y)2+(USSUI)(y~y)2(USSU)(y~y)2+USSUI2y~y2(USSU)(y~y)2+12y~y2y~y22(USSU)(y~y)2\begin{aligned} \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2 &\leq \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 + \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 \\ &\leq \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 + \|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I}\|_2\|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2 \\ &\leq \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 + {\textstyle\frac{ 1}{ 2}} \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2 \\ \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2 &\leq 2\|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 \end{aligned}
Next, since y~=arg minySUySb2\tilde{\mathbf{y}} = \argmin_{\mathbf{y}}\|\boldsymbol{S}\boldsymbol{U}\mathbf{y}-\boldsymbol{S}\mathbf{b}\|_2, we know that SUy~Sb\boldsymbol{S}\boldsymbol{U}\tilde{\mathbf{y}} - \boldsymbol{S}\mathbf{b} is orthogonal to everything in the range of SU\boldsymbol{S}\boldsymbol{U}. That is (SU)(SUy~Sb)=0(\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{S}\mathbf{b})=0, so we get that
(USSU)(y~y)2=(SU)(SUy~SUy)2=(SU)(SUy~Sb+SbSUy)2=0+(SU)(SbSUy)2=(SU)S(Uyb)2\begin{aligned} \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 &= \|(\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{S}\boldsymbol{U}\mathbf{y}^*)\|_2 \\ &= \|(\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{S}\mathbf{b}+\boldsymbol{S}\mathbf{b}-\boldsymbol{S}\boldsymbol{U}\mathbf{y}^*)\|_2 \\ &= \|0 + (\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\mathbf{b}-\boldsymbol{S}\boldsymbol{U}\mathbf{y}^*)\|_2 \\ &= \|(\boldsymbol{S}\boldsymbol{U})^\intercal \boldsymbol{S}(\boldsymbol{U}\mathbf{y}^*-\mathbf{b})\|_2 \end{aligned}
Where this norm at the end now matches the shape of the Fast Matrix-Matrix Multiplication algorithm we studied. In particular, this is like fast matrix multiplication of the matrices U\boldsymbol{U} and Uyb\boldsymbol{U}\mathbf{y}^*-\mathbf{b}, where the sampling probabilities are proportional to the leverage scores of A\boldsymbol{A}, which by Lemma 3 of the Basic Properties of Leverage Scores Page, are equal to the row norms of U\boldsymbol{U}. So, by Theorem 1 of the Fast Matrix Multiplication page with k=Ω(dδε)k = \Omega(\frac{d}{\delta\varepsilon}) and with probability 1δ21-\frac\delta2, we have

(SU)S(Uyb)2ε2dUFUyb2 \|(\boldsymbol{S}\boldsymbol{U})^\intercal \boldsymbol{S}(\boldsymbol{U}\mathbf{y}^*-\mathbf{b})\|_2 \leq \frac{\sqrt{\varepsilon}}{2\sqrt{d}} \|\boldsymbol{U}\|_F \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2

U\boldsymbol{U} is an orthogonal matrix, so each column of U\boldsymbol{U} has norm 1, so UF2=d\|\boldsymbol{U}\|_F^2 = d. Therefore, we get

(SU)S(Uyb)2ε2Uyb2 \|(\boldsymbol{S}\boldsymbol{U})^\intercal \boldsymbol{S}(\boldsymbol{U}\mathbf{y}^*-\mathbf{b})\|_2 \leq \frac{\sqrt{\varepsilon}}{2} \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2

Backing up, we have overall proven that

Ax~b22=Axb22+Ax~Ax22=Uyb22+Uy~Uy22=Uyb22+y~y22Uyb22+4(USSU)(y~y)22Uyb22+εUyb22=(1+ε)Uyb22=(1+ε)Axb22\begin{aligned} \|\boldsymbol{A}\tilde{\mathbf{x}}-\mathbf{b}\|_2^2 &= \|\boldsymbol{A}\mathbf{x}^*-\mathbf{b}\|_2^2 + \|\boldsymbol{A}\tilde{\mathbf{x}}-\boldsymbol{A}\mathbf{x}^*\|_2^2 \\ &= \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + \|\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{U}\mathbf{y}^*\|_2^2 \\ &= \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2^2 \\ &\leq \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + 4\|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2^2 \\ &\leq \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + \varepsilon\|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 \\ &= (1+\varepsilon)\|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 \\ &= (1+\varepsilon)\|\boldsymbol{A}\mathbf{x}^*-\mathbf{b}\|_2^2 \end{aligned}
Which completes the proof.
\blacksquare \, \,


Notably, this proof uses the sampling probabilities in only two places: showing the subspace embedding guarantee USSUI212\|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}-\boldsymbol{I}\|_2\leq\frac12 and showing the fast matrix multiplication (SU)S(Uyb)2ε2dUFUyb2\|(\boldsymbol{S}\boldsymbol{U})^\intercal \boldsymbol{S}(\boldsymbol{U}\mathbf{y}^*-\mathbf{b})\|_2 \leq \frac{\sqrt{\varepsilon}}{2\sqrt{d}} \|\boldsymbol{U}\|_F \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2. Both of these proofs work with inexact leverage scores sampling probabilities, namely via oversampling, as shown on their respective pages (Theorem 3 here and Corollary 1 here). We can therefore state a more general algorithm and guarantee for leverage score sampling with the exact same proof analysis:

Algorithm 2: Leverage Scores for Active Regression


input: Matrix ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d}. Query access to bRn\mathbf{b}\in\mathbb{R}^{n}. Number kk of subsamples. Sampling probabilities p1,,pnp_1,\ldots,p_n.

output: Approximate x~Rk×d\tilde{\mathbf{x}}\in\mathbb{R}^{k \times d}

  1. Sample indices s1,,sk[n]s_1,\ldots,s_k\in[n] iid with respect to p1,,pnp_1,\ldots,p_n

  2. Let pi := τidp_i \;{\vcentcolon=}\; \frac{\tau_i}{d} denote the probability of sampling row ii

  3. Build the sample-and-rescale matrix SRk×n\boldsymbol{S}\in\mathbb{R}^{k \times n}:

    Row ii of S\boldsymbol{S} has form [0001kpsi00]\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}, where index sis_i is the nonzero entry.

  4. Return x~=arg minxSASb2\tilde{\mathbf{x}} = \argmin_{\mathbf{x}}\|\boldsymbol{S}\boldsymbol{A}-\boldsymbol{S}\mathbf{b}\|_2

Theorem 2: Active L2L_2 Regression (Oversampling)
Theorem

Fix ε>0,δ>0\varepsilon>0,\delta>0. Let τ1,,τn\tau_1,\ldots,\tau_n be the leverage scores of ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d}, and let τ~1,,τ~n\tilde\tau_1,\ldots,\tilde\tau_n be overestimates of the leverage scores, so that τ~iτi\tilde\tau_i \geq \tau_i. Let d~ := i=1nτ~i\tilde d \;{\vcentcolon=}\; \sum_{i=1}^n \tilde\tau_i. Then let x~\tilde{\mathbf{x}} be the output of Theorem 2 with k=Ω(d~log(dδ)+d~εδ)k = \Omega(\tilde d \log(\frac d \delta) + \frac{\tilde d}{\varepsilon \delta}) and probabilities pi=τ~id~p_i = \frac{\tilde\tau_i}{\tilde d} With probability 1δ1-\delta, we then have

Ax~b22(1+ε)minxRdAxb22 \|\boldsymbol{A}\tilde{\mathbf{x}}-\mathbf{b}\|_2^2 \leq (1+\varepsilon) \min_{\mathbf{x}\in\mathbb{R}^{d}}\|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2^2

Proof. To get the subspace embedding, we directly appeal to Theorem 3 here. We trace through the fast matrix multiplication more carefully. We appeal to Corollary 1 here with error tolerance ε0=εd\varepsilon_0 = \sqrt{\frac\varepsilon d}, and with T=d~T = \tilde d since τ~i\tilde\tau_i overestimate the row norms of U\boldsymbol{U}. So, we need to sample

k1ε02δTUF2=dεδd~d=d~εδ k \geq \frac{1}{\varepsilon_0^2 \delta} \cdot \frac{T}{\|\boldsymbol{U}\|_F^2} = \frac{d}{\varepsilon \delta} \cdot \frac{\tilde d}{d} = \frac{\tilde d}{\varepsilon \delta}

rows to maintain correctness.

\blacksquare \, \,

See Also

The proofs above are a near-exact copy of an unpublished note by Christopher Musco, and it closely tracks an analysis of Woodruff (2014) which studies regression under oblivious sketching instead of leverage score sampling.

There are resources with similar analyses in the literature:

  • (Sarlos (2017)) Is the original paper and analysis.

  • (Chen Price (2019)) Explicitly studies active regression under Leverage Functions, which relates to Linear Operators, more generally than just matrices.

  • (Musco et al. (2019)) Studies active regression beyond the 2\ell_2 norm, like in the p\ell_p norm.

  • (Meyer Musco (2020)) The appendix studies active 2\ell_2 regression for both matrices and operators, but where we choose one of many design matrices A1,,Aq\boldsymbol{A}_1,\ldots,\boldsymbol{A}_q, and can only achieve constant factor error (i.e. ε>Ω(1)\varepsilon > \Omega(1)).

  • (Kapralov et al. (2023)) Contains an analysis of active 2\ell_2 regression, but where we choose one of many design matrices A1,,Aq\boldsymbol{A}_1,\ldots,\boldsymbol{A}_q, and can achieve (1+ε)(1+\varepsilon) factor error, but with a O~(d~2)\tilde O(\tilde d^2) dependence.

  • Let me know if anything is missing

Bibliography