Active $L_2$ Regression via Leverage Score Sampling

Consider trying to solve $\min_{\mathbf{x}} \|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2$ , but under the active regression regime. That is, we see the full matrix $\boldsymbol{A}$ a priori, but do not know the corresponding entries of $\mathbf{b}$ . We can observe entries of $\mathbf{b}$ , but this is expensive, and we want to read as few entries as possible.

Prerequisite: Subspace Embedding via Leverage Score Sampling (Matrix Bernstein)

Prerequisite: Fast Matrix-Matrix Multiplication

Without this constraint on the information in $\mathbf{b}$ , solving $\min_{\mathbf{x}} \|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2$ from subsampling a small number of rows is relatively straightforward: build a subspace embedding for the rows of the matrix $\bar{\boldsymbol{A}} \;{\vcentcolon=}\; [\boldsymbol{A} \, \mathbf{b}]$ , sampling by the leverage scores of the expanded matrix.

We will analyze the following algorithm:

Algorithm 1: Leverage Scores for Active Regression

input: Matrix $\boldsymbol{A}\in\mathbb{R}^{n \times d}$ . Query access to $\mathbf{b}\in\mathbb{R}^{n}$ . Number $k$ of subsamples.

output: Approximate $\tilde{\mathbf{x}}\in\mathbb{R}^{k \times d}$

Sample indices $s_1,\ldots,s_k\in[n]$ iid with respect to the leverage scores of $\boldsymbol{A}$
Let $p_i \;{\vcentcolon=}\; \frac{\tau_i}{d}$ denote the probability of sampling row $i$
Build the sample-and-rescale matrix $\boldsymbol{S}\in\mathbb{R}^{k \times n}$ :

Row $i$ of $\boldsymbol{S}$ has form $\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}$ , where index $s_i$ is the nonzero entry.
Return $\tilde{\mathbf{x}} = \argmin_{\mathbf{x}}\|\boldsymbol{S}\boldsymbol{A}-\boldsymbol{S}\mathbf{b}\|_2$

Crucially, this algorithm only looks at the rows of $\mathbf{b}$ selected by the sampling matrix, and this sampling matrix does not know anything about $\mathbf{b}$ . Hence, the algorithm fits in the active regression model. We show the following:

Theorem 1: Active $L_2$ Regression
Theorem

Let

\boldsymbol{A}\in\mathbb{R}^{n \times d}

be a tall-and-skinny matrix, and let

\mathbf{b}\in\mathbb{R}^{n}

. Then let

\tilde{\mathbf{x}}

be the result of Algorithm 1 run with

k = \Omega(d \log(\frac d\delta) + \frac{d}{\delta\varepsilon})

. With probability

1-\delta

, we then have

\|\boldsymbol{A}\tilde{\mathbf{x}}-\mathbf{b}\|_2 \leq (1+\varepsilon) \min_{\mathbf{x}\in\mathbb{R}^{d}}\|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2

Proof. Let $\mathbf{x}^* \;{\vcentcolon=}\; \argmin_{\mathbf{x}} \|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2$ be the true solution to the full $L_2$ regression problem. Recall by setting the derivative of $\|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2^2$ to be zero, we find that $\boldsymbol{A}^\intercal\boldsymbol{A}\mathbf{x}^*-\boldsymbol{A}^\intercal\mathbf{b}=0$ . Equivalently, $\boldsymbol{A}^\intercal(\boldsymbol{A}\mathbf{x}^*-\mathbf{b})=0$ , or in other words, $\boldsymbol{A}\mathbf{x}^*-\mathbf{b}$ is orthogonal to the range of $\boldsymbol{A}$ . In particular, $\boldsymbol{A}\mathbf{x}^*-\mathbf{b}$ is orthogonal to $\boldsymbol{A}\tilde{\mathbf{x}}-\boldsymbol{A}\mathbf{x}^*$ , and so we get

\|\boldsymbol{A}\tilde{\mathbf{x}}-\mathbf{b}\|_2^2 = \|\boldsymbol{A}\mathbf{x}^*-\mathbf{b}\|_2^2 + \|\boldsymbol{A}\tilde{\mathbf{x}}-\boldsymbol{A}\mathbf{x}^*\|_2^2

Let $\boldsymbol{U}\in\mathbb{R}^{n \times d}$ be an orthogonal matrix than spans the range of $\boldsymbol{A}$ (e.g. the $\boldsymbol{U}$ matrix from the SVD of $\boldsymbol{A}$ ). Then, for every $\mathbf{x}\in\mathbb{R}^d$ there exists some $\mathbf{y}\in\mathbb{R}^d$ such that $\boldsymbol{A}\mathbf{x}=\boldsymbol{U}\mathbf{y}$ . Let $\tilde{\mathbf{y}}$ and $\mathbf{y}^*$ be the vectors so that $\boldsymbol{A}\tilde{\mathbf{x}}=\boldsymbol{U}\tilde{\mathbf{y}}$ and $\boldsymbol{A}\mathbf{x}^*=\boldsymbol{U}\mathbf{y}^*$ . Then, the above inequality can be written as

\|\boldsymbol{U}\tilde{\mathbf{y}}-\mathbf{b}\|_2^2 = \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + \|\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{U}\mathbf{y}^*\|_2^2

Since $\boldsymbol{U}$ is an orthogonal matrix, we have $\|\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{U}\mathbf{y}^*\|_2^2 = \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2^2$ . Our goal is to bound the right hand side by $(1+\varepsilon)\|\boldsymbol{A}\mathbf{x}^*-\mathbf{b}\|_2^2$ , and so it suffices to show that $\|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2^2 \leq \varepsilon \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|$ . Recall from Lemma 1 of the Subspace Embedding Page that since $k = \Omega(d \log(\frac d\delta))$ , with probability $1-\frac\delta2$ , we have $\|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}-\boldsymbol{I}\|_2 \leq \frac12$ . Then,

\begin{aligned} \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2 &\leq \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 + \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 \\ &\leq \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 + \|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U} - \boldsymbol{I}\|_2\|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2 \\ &\leq \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 + {\textstyle\frac{ 1}{ 2}} \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2 \\ \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2 &\leq 2\|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 \end{aligned}

Next, since

\tilde{\mathbf{y}} = \argmin_{\mathbf{y}}\|\boldsymbol{S}\boldsymbol{U}\mathbf{y}-\boldsymbol{S}\mathbf{b}\|_2

, we know that

\boldsymbol{S}\boldsymbol{U}\tilde{\mathbf{y}} - \boldsymbol{S}\mathbf{b}

is orthogonal to everything in the range of

\boldsymbol{S}\boldsymbol{U}

. That is

(\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{S}\mathbf{b})=0

, so we get that

\begin{aligned} \|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2 &= \|(\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{S}\boldsymbol{U}\mathbf{y}^*)\|_2 \\ &= \|(\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{S}\mathbf{b}+\boldsymbol{S}\mathbf{b}-\boldsymbol{S}\boldsymbol{U}\mathbf{y}^*)\|_2 \\ &= \|0 + (\boldsymbol{S}\boldsymbol{U})^\intercal(\boldsymbol{S}\mathbf{b}-\boldsymbol{S}\boldsymbol{U}\mathbf{y}^*)\|_2 \\ &= \|(\boldsymbol{S}\boldsymbol{U})^\intercal \boldsymbol{S}(\boldsymbol{U}\mathbf{y}^*-\mathbf{b})\|_2 \end{aligned}

Where this norm at the end now matches the shape of the Fast Matrix-Matrix Multiplication algorithm we studied. In particular, this is like fast matrix multiplication of the matrices

\boldsymbol{U}

and

\boldsymbol{U}\mathbf{y}^*-\mathbf{b}

, where the sampling probabilities are proportional to the leverage scores of

\boldsymbol{A}

, which by Lemma 3 of the Basic Properties of Leverage Scores Page, are equal to the row norms of

\boldsymbol{U}

. So, by Theorem 1 of the Fast Matrix Multiplication page with

k = \Omega(\frac{d}{\delta\varepsilon})

and with probability

1-\frac\delta2

, we have

\|(\boldsymbol{S}\boldsymbol{U})^\intercal \boldsymbol{S}(\boldsymbol{U}\mathbf{y}^*-\mathbf{b})\|_2 \leq \frac{\sqrt{\varepsilon}}{2\sqrt{d}} \|\boldsymbol{U}\|_F \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2

$\boldsymbol{U}$ is an orthogonal matrix, so each column of $\boldsymbol{U}$ has norm 1, so $\|\boldsymbol{U}\|_F^2 = d$ . Therefore, we get

\|(\boldsymbol{S}\boldsymbol{U})^\intercal \boldsymbol{S}(\boldsymbol{U}\mathbf{y}^*-\mathbf{b})\|_2 \leq \frac{\sqrt{\varepsilon}}{2} \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2

Backing up, we have overall proven that

\begin{aligned} \|\boldsymbol{A}\tilde{\mathbf{x}}-\mathbf{b}\|_2^2 &= \|\boldsymbol{A}\mathbf{x}^*-\mathbf{b}\|_2^2 + \|\boldsymbol{A}\tilde{\mathbf{x}}-\boldsymbol{A}\mathbf{x}^*\|_2^2 \\ &= \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + \|\boldsymbol{U}\tilde{\mathbf{y}}-\boldsymbol{U}\mathbf{y}^*\|_2^2 \\ &= \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + \|\tilde{\mathbf{y}}-\mathbf{y}^*\|_2^2 \\ &\leq \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + 4\|(\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U})(\tilde{\mathbf{y}}-\mathbf{y}^*)\|_2^2 \\ &\leq \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 + \varepsilon\|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 \\ &= (1+\varepsilon)\|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2^2 \\ &= (1+\varepsilon)\|\boldsymbol{A}\mathbf{x}^*-\mathbf{b}\|_2^2 \end{aligned}

Which completes the proof.

\blacksquare \, \,

Notably, this proof uses the sampling probabilities in only two places: showing the subspace embedding guarantee $\|\boldsymbol{U}^\intercal\boldsymbol{S}^\intercal\boldsymbol{S}\boldsymbol{U}-\boldsymbol{I}\|_2\leq\frac12$ and showing the fast matrix multiplication $\|(\boldsymbol{S}\boldsymbol{U})^\intercal \boldsymbol{S}(\boldsymbol{U}\mathbf{y}^*-\mathbf{b})\|_2 \leq \frac{\sqrt{\varepsilon}}{2\sqrt{d}} \|\boldsymbol{U}\|_F \|\boldsymbol{U}\mathbf{y}^*-\mathbf{b}\|_2$ . Both of these proofs work with inexact leverage scores sampling probabilities, namely via oversampling, as shown on their respective pages (Theorem 3 here and Corollary 1 here). We can therefore state a more general algorithm and guarantee for leverage score sampling with the exact same proof analysis:

Algorithm 2: Leverage Scores for Active Regression

input: Matrix $\boldsymbol{A}\in\mathbb{R}^{n \times d}$ . Query access to $\mathbf{b}\in\mathbb{R}^{n}$ . Number $k$ of subsamples. Sampling probabilities $p_1,\ldots,p_n$ .

output: Approximate $\tilde{\mathbf{x}}\in\mathbb{R}^{k \times d}$

Sample indices $s_1,\ldots,s_k\in[n]$ iid with respect to $p_1,\ldots,p_n$
Let $p_i \;{\vcentcolon=}\; \frac{\tau_i}{d}$ denote the probability of sampling row $i$
Build the sample-and-rescale matrix $\boldsymbol{S}\in\mathbb{R}^{k \times n}$ :

Row $i$ of $\boldsymbol{S}$ has form $\begin{bmatrix}0&0&\cdots&0&\frac{1}{\sqrt{k p_{s_i}}}&0&\cdots&0\end{bmatrix}$ , where index $s_i$ is the nonzero entry.
Return $\tilde{\mathbf{x}} = \argmin_{\mathbf{x}}\|\boldsymbol{S}\boldsymbol{A}-\boldsymbol{S}\mathbf{b}\|_2$

Theorem 2: Active $L_2$ Regression (Oversampling)
Theorem

Fix

\varepsilon>0,\delta>0

. Let

\tau_1,\ldots,\tau_n

be the leverage scores of

\boldsymbol{A}\in\mathbb{R}^{n \times d}

, and let

\tilde\tau_1,\ldots,\tilde\tau_n

be overestimates of the leverage scores, so that

\tilde\tau_i \geq \tau_i

. Let

\tilde d \;{\vcentcolon=}\; \sum_{i=1}^n \tilde\tau_i

. Then let

\tilde{\mathbf{x}}

be the output of Theorem 2 with

k = \Omega(\tilde d \log(\frac d \delta) + \frac{\tilde d}{\varepsilon \delta})

and probabilities

p_i = \frac{\tilde\tau_i}{\tilde d}

With probability

1-\delta

, we then have

\|\boldsymbol{A}\tilde{\mathbf{x}}-\mathbf{b}\|_2^2 \leq (1+\varepsilon) \min_{\mathbf{x}\in\mathbb{R}^{d}}\|\boldsymbol{A}\mathbf{x}-\mathbf{b}\|_2^2

Proof. To get the subspace embedding, we directly appeal to Theorem 3 here. We trace through the fast matrix multiplication more carefully. We appeal to Corollary 1 here with error tolerance $\varepsilon_0 = \sqrt{\frac\varepsilon d}$ , and with $T = \tilde d$ since $\tilde\tau_i$ overestimate the row norms of $\boldsymbol{U}$ . So, we need to sample

k \geq \frac{1}{\varepsilon_0^2 \delta} \cdot \frac{T}{\|\boldsymbol{U}\|_F^2} = \frac{d}{\varepsilon \delta} \cdot \frac{\tilde d}{d} = \frac{\tilde d}{\varepsilon \delta}

rows to maintain correctness.

\blacksquare \, \,

Bibliography

Woodruff. Sketching as a Tool for Numerical Linear Algebra.. Foundations and Trends® in Theoretical Computer Science 2014.
Sarlos. Improved Approximation Algorithms for Large Matrices via Random Projections. FOCS 2006.
Chen and Price. Active Regression via Linear-Sample Sparsification. COLT 2019.
Musco, Musco, Woodruff, and Yasuda. Active Sampling for Linear Regression Beyond the $\ell_2$ Norm. FOCS 2022.
Meyer and Musco. The Statistical Cost of Robust Kernel Hyperparameter Turning. NeurIPS 2020.
Kapralov, Lawrence, Makarov, Musco, and Sheth. Toeplitz Low-Rank Approximation with Sublinear Query Complexity. SODA 2023.

Active L2L_2L2​ Regression via Leverage Score Sampling

Theorem 1: Active L2L_2L2​ Regression Theorem

Theorem 2: Active L2L_2L2​ Regression (Oversampling) Theorem

See Also

Bibliography

Active $L_2$ Regression via Leverage Score Sampling

Theorem 1: Active $L_2$ Regression
Theorem

Theorem 2: Active $L_2$ Regression (Oversampling)
Theorem