On this page, we will show that three different definitions of Leverage Scores are equivalent. As a small supplement, we also use two of these forms to bound the maximum value of a leverage score, and also to compute the sum of all leverage scores.
Let
A ∈ R n × d \boldsymbol{A}\in\mathbb{R}^{n \times d} A ∈ R n × d . Then the Maximum Characterization of the Leverage Score for row
i i i of
A \boldsymbol{A} A is
τ i ( A ) : = max x ∈ R d [ A x ] i 2 ∥ A x ∥ 2 2 \tau_i(\boldsymbol{A}) \;{\vcentcolon=}\; \max_{\mathbf{x}\in\mathbb{R}^d} \frac{[\boldsymbol{A}\mathbf{x}]_i^2}{\|\boldsymbol{A}\mathbf{x}\|_2^2} τ i ( A ) : = x ∈ R d max ∥ A x ∥ 2 2 [ A x ] i 2
Let
a i \mathbf{a}_i a i be the
i t h i^{th} i t h row of
A \boldsymbol{A} A . Then,
τ i ( A ) = a i ⊺ ( A ⊺ A ) − 1 a i \tau_i(\boldsymbol{A}) = \mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i τ i ( A ) = a i ⊺ ( A ⊺ A ) − 1 a i
Proof
Proof. This is proven in two steps. First, we show that the inner product characterization upper bounds the max characterization. Then, we show a matching lower bound. For simplicity, we prove this for full-rank A \boldsymbol{A} A .
Before getting started, we note a useful equation: ∥ A ( A ⊺ A ) − 1 a i ∥ 2 2 = ( a i ⊺ ( A ⊺ A ) − 1 A ⊺ A ( A ⊺ A ) − 1 a i ) = ( a i ⊺ ( A ⊺ A ) − 1 a i ) \begin{aligned} \|\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i\|_2^2 &= (\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\boldsymbol{A}^\intercal\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i) \\ &= (\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i) \end{aligned} ∥ A ( A ⊺ A ) − 1 a i ∥ 2 2 = ( a i ⊺ ( A ⊺ A ) − 1 A ⊺ A ( A ⊺ A ) − 1 a i ) = ( a i ⊺ ( A ⊺ A ) − 1 a i )
Upper Bound: To create the upper bound, we relate [ A x ] i 2 = ( a i ⊺ x ) 2 = ( a i ⊺ ( A ⊺ A ) − 1 ( A ⊺ A ) x ) 2 = ( ( A ( A ⊺ A ) − 1 a i ) ⊺ ( A x ) ) 2 ≤ ∥ A ( A ⊺ A ) − 1 a i ∥ 2 2 ⋅ ∥ A x ∥ 2 2 = ( a i ⊺ ( A ⊺ A ) − 1 a i ) ⋅ ∥ A x ∥ 2 2 \begin{aligned} [\boldsymbol{A}\mathbf{x}]_i^2 &= (\mathbf{a}_i^\intercal\mathbf{x})^2 \\ &= (\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}(\boldsymbol{A}^\intercal\boldsymbol{A})\mathbf{x})^2 \\ &= ((\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i)^\intercal ~ (\boldsymbol{A}\mathbf{x}))^2 \\ &\leq \|\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i\|_2^2 \cdot \|\boldsymbol{A}\mathbf{x}\|_2^2 \\ &= (\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i) \cdot \|\boldsymbol{A}\mathbf{x}\|_2^2 \end{aligned} [ A x ] i 2 = ( a i ⊺ x ) 2 = ( a i ⊺ ( A ⊺ A ) − 1 ( A ⊺ A ) x ) 2 = (( A ( A ⊺ A ) − 1 a i ) ⊺ ( A x ) ) 2 ≤ ∥ A ( A ⊺ A ) − 1 a i ∥ 2 2 ⋅ ∥ A x ∥ 2 2 = ( a i ⊺ ( A ⊺ A ) − 1 a i ) ⋅ ∥ A x ∥ 2 2 Where the inequality used is the Cauchy-Schwarz Inequality: ( v ⊺ y ) 2 ≤ ∥ v ∥ 2 2 ⋅ ∥ y ∥ 2 2 (\mathbf{v}^\intercal\mathbf{y})^2\leq \|\mathbf{v}\|_2^2 \cdot \|\mathbf{y}\|_2^2 ( v ⊺ y ) 2 ≤ ∥ v ∥ 2 2 ⋅ ∥ y ∥ 2 2 . We can then give an upper bound to the max characterization:
τ i ( A ) = max x ∈ R d [ A x ] i 2 ∥ A ∥ 2 2 ≤ max x ∈ R d ( a i ( A ⊺ A ) − 1 a i ) ⋅ ∥ A x ∥ 2 2 ∥ A x ∥ 2 2 = a i ( A ⊺ A ) − 1 a i \tau_i(\boldsymbol{A}) = \max_{\mathbf{x}\in\mathbb{R}^d} \frac{[\boldsymbol{A}\mathbf{x}]_i^2}{\|\boldsymbol{A}\|_2^2} \leq \max_{\mathbf{x}\in\mathbb{R}^d} \frac{(\mathbf{a}_i(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i) \cdot \|\boldsymbol{A}\mathbf{x}\|_2^2}{\|\boldsymbol{A}\mathbf{x}\|_2^2} = \mathbf{a}_i(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i τ i ( A ) = x ∈ R d max ∥ A ∥ 2 2 [ A x ] i 2 ≤ x ∈ R d max ∥ A x ∥ 2 2 ( a i ( A ⊺ A ) − 1 a i ) ⋅ ∥ A x ∥ 2 2 = a i ( A ⊺ A ) − 1 a i Lower Bound: For the lower bound, we just plug x = ( A ⊺ A ) − 1 a i \mathbf{x}=(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i x = ( A ⊺ A ) − 1 a i into the max characterization: τ i ( A ) ≥ [ A ( A ⊺ A ) − 1 a i ] i 2 ∥ A ( A ⊺ A ) − 1 a i ∥ 2 2 = ( a i ⊺ ( A ⊺ A ) − 1 a i ) 2 a i ⊺ ( A ⊺ A ) − 1 a i = a i ⊺ ( A ⊺ A ) − 1 a i \begin{aligned} \tau_i(\boldsymbol{A}) \geq \frac{[\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i]_i^2}{\|\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i\|_2^2} = \frac{(\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i)^2}{\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i} = \mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i \end{aligned} τ i ( A ) ≥ ∥ A ( A ⊺ A ) − 1 a i ∥ 2 2 [ A ( A ⊺ A ) − 1 a i ] i 2 = a i ⊺ ( A ⊺ A ) − 1 a i ( a i ⊺ ( A ⊺ A ) − 1 a i ) 2 = a i ⊺ ( A ⊺ A ) − 1 a i Which completes the proof.
Let
a i \mathbf{a}_i a i be the
i t h i^{th} i t h row of
A \boldsymbol{A} A . Then
τ i ( A ) = min y ∈ R n , A ⊺ y = a i ∥ y ∥ 2 2 \tau_i(\boldsymbol{A}) = \min_{\mathbf{y}\in\mathbb{R}^n,\,\boldsymbol{A}^\intercal\mathbf{y}=\mathbf{a}_i} \|\mathbf{y}\|_2^2 τ i ( A ) = y ∈ R n , A ⊺ y = a i min ∥ y ∥ 2 2
Proof
Proof. For simplicity, we assume that A \boldsymbol{A} A is both full-rank and tall-and-skinny.
Notice this minimization problem a minimum-norm underdetermined least-squares problem, with known solution y = A ( A ⊺ A ) − 1 a i \mathbf{y}=\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i y = A ( A ⊺ A ) − 1 a i . So, we know that
min y ∈ R n , A ⊺ y = a i ∥ y ∥ 2 2 = ∥ A ( A ⊺ A ) − 1 a i ∥ 2 2 = a i ⊺ ( A ⊺ A ) − 1 a i = τ i ( A ) \min_{\mathbf{y}\in\mathbb{R}^n,\,\boldsymbol{A}^\intercal\mathbf{y}=\mathbf{a}_i} \|\mathbf{y}\|_2^2 = \|\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i\|_2^2 = \mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i = \tau_i(\boldsymbol{A}) y ∈ R n , A ⊺ y = a i min ∥ y ∥ 2 2 = ∥ A ( A ⊺ A ) − 1 a i ∥ 2 2 = a i ⊺ ( A ⊺ A ) − 1 a i = τ i ( A ) where the second equality is shown in the start to the proof of Lemma 1 , and the last equality is Lemma 1 .
Let
A ∈ R n × d \boldsymbol{A}\in\mathbb{R}^{n \times d} A ∈ R n × d full-rank with
n ≥ d n \geq d n ≥ d . Then,
Each leverage score has τ i ( A ) ∈ [ 0 , 1 ] \tau_i(\boldsymbol{A})\in[0,1] τ i ( A ) ∈ [ 0 , 1 ]
If B ∈ R d × d \boldsymbol{B}\in\mathbb{R}^{d \times d} B ∈ R d × d is full-rank, then τ i ( A B ) = τ i ( A ) \tau_i(\boldsymbol{A}\boldsymbol{B}) = \tau_i(\boldsymbol{A}) τ i ( A B ) = τ i ( A )
If A ⊺ A = I \boldsymbol{A}^\intercal\boldsymbol{A}=\boldsymbol{I} A ⊺ A = I , then τ i ( A ) = ∥ a i ∥ 2 2 \tau_i(\boldsymbol{A})=\|\mathbf{a}_i\|_2^2 τ i ( A ) = ∥ a i ∥ 2 2
If U ∈ R n × d \boldsymbol{U}\in\mathbb{R}^{n \times d} U ∈ R n × d with U ⊺ U = I \boldsymbol{U}^\intercal\boldsymbol{U}=\boldsymbol{I} U ⊺ U = I has the same columnspace as A \boldsymbol{A} A , then τ i ( A ) = ∥ u i ∥ 2 2 \tau_i(\boldsymbol{A})=\|\mathbf{u}_i\|_2^2 τ i ( A ) = ∥ u i ∥ 2 2 where u i \mathbf{u}_i u i is the i t h i^{th} i t h row of U \boldsymbol{U} U .
The sum of leverages is ∑ i = 1 n τ i ( A ) = d \sum_{i=1}^n \tau_i(\boldsymbol{A})=d ∑ i = 1 n τ i ( A ) = d
Proof
Proof.
Point 1 follows directly from the Max Characterization (Definition 1 ).
Point 2 also follows form the Max Characterization. In particular, for any x ∈ R d \mathbf{x}\in\mathbb{R}^{d} x ∈ R d we can define y : = B x \mathbf{y}\;{\vcentcolon=}\;\boldsymbol{B}\mathbf{x} y : = B x . Since B \boldsymbol{B} B is invertible, the maximizing over all x ∈ R d \mathbf{x}\in\mathbb{R}^d x ∈ R d is equivalent to maximizing over all y ∈ R d \mathbf{y}\in\mathbb{R}^d y ∈ R d :
τ i ( A B ) = max x ∈ R d [ A B x ] i 2 ∥ A B x ∥ 2 2 = max y ∈ R d [ A y ] i 2 ∥ A y ∥ 2 2 = τ i ( A ) \tau_i(\boldsymbol{A}\boldsymbol{B}) = \max_{\mathbf{x}\in\mathbb{R}^d} \frac{[\boldsymbol{A}\boldsymbol{B}\mathbf{x}]_i^2}{\|\boldsymbol{A}\boldsymbol{B}\mathbf{x}\|_2^2} = \max_{\mathbf{y}\in\mathbb{R}^d} \frac{[\boldsymbol{A}\mathbf{y}]_i^2}{\|\boldsymbol{A}\mathbf{y}\|_2^2} = \tau_i(\boldsymbol{A}) τ i ( A B ) = x ∈ R d max ∥ A B x ∥ 2 2 [ A B x ] i 2 = y ∈ R d max ∥ A y ∥ 2 2 [ A y ] i 2 = τ i ( A ) Point 3 follows from the Inner Product Characterization:
τ i ( A ) = a i ⊺ ( A ⊺ A ) − 1 a i = a i ⊺ a i = ∥ a i ∥ 2 2 \tau_i(\boldsymbol{A}) = \mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i = \mathbf{a}_i^\intercal\mathbf{a}_i = \|\mathbf{a}_i\|_2^2 τ i ( A ) = a i ⊺ ( A ⊺ A ) − 1 a i = a i ⊺ a i = ∥ a i ∥ 2 2 Point 4 follows from Points 2 and 3.
Point 5 follows from Point 4: Since every A \boldsymbol{A} A has an SVD A = U Σ V ⊺ \boldsymbol{A}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^\intercal A = U Σ V ⊺ , we can find an orthogonal U \boldsymbol{U} U that satisfied Point 4. Let u i \mathbf{u}_i u i denote the i t h i^{th} i t h row of U \boldsymbol{U} U and let u ^ j \hat{\mathbf{u}}_j u ^ j denote the j t h j^{th} j t h column of U \boldsymbol{U} U . Then,
∑ i = 1 n τ i ( A ) = ∑ i = 1 n ∥ u i ∥ 2 2 = ∑ i = 1 n ∑ j = 1 d U i j 2 = ∑ j = 1 d ∥ u ^ j ∥ 2 2 = d \sum_{i=1}^n \tau_i(\boldsymbol{A}) = \sum_{i=1}^n \|\mathbf{u}_i\|_2^2 = \sum_{i=1}^n \sum_{j=1}^d \boldsymbol{U}_{ij}^2 = \sum_{j=1}^d \|\hat{\mathbf{u}}_j\|_2^2 = d i = 1 ∑ n τ i ( A ) = i = 1 ∑ n ∥ u i ∥ 2 2 = i = 1 ∑ n j = 1 ∑ d U ij 2 = j = 1 ∑ d ∥ u ^ j ∥ 2 2 = d There are some intuitive implications from these bullet points:
The leverage scores of A \boldsymbol{A} A depend only on the range of A \boldsymbol{A} A . Any other matrix with the same column space has the same leverage scores.
If A = Q R \boldsymbol{A}=\boldsymbol{Q}\boldsymbol{R} A = Q R is the economic QR decomposition of A \boldsymbol{A} A , and if q 1 , … , q n \mathbf{q}_1,\ldots,\mathbf{q}_n q 1 , … , q n are the rows of Q \boldsymbol{Q} Q , then τ i ( A ) = ∥ q ∥ 2 2 \tau_i(\boldsymbol{A})=\|\mathbf{q}\|_2^2 τ i ( A ) = ∥ q ∥ 2 2 .
The proofs above are most closely related to their highly generalized analysis in Theorem 5 from Avron et al. (2019) .
There are many extensions of leverage scores in the literature:
(Cohen et al. (2017) ) Ridge Leverage Scores, which relate to Ridge Regression.
(Alaoui Mahoney (2015) ) Kernel Ridge Leverage Scores, which relate to Kernel Ridge Regression.
(Avron et al. (2017) ) Kernel Ridge Leverage Function, which relates to Random Fourier Features for Kernel Ridge Regression.
(Chen Price (2019) ) Leverage Function, which relates to Linear Operators instead of matrices. Sometimes called a Sensitvitiy Score .
(Avron et al. (2019) ) Ridge Leverage Function, which relates to Linear Operators instead of matrices for Ridge Regression .
(Cohen Peng (2019) ) Lewis Weights, which relate to L p L_p L p norms instead of just L 2 L_2 L 2 norms.
Let me know if anything is missing
Alaoui and Mahoney . Fast Randomized Kernel Methods with Statistical Guarantees. . NIPS 2015.
Avron , Kapralov , Musco , Musco , Velingker , and Zandieh . Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees . ICML 2017.
Avron , Kapralov , Musco , Musco , Velingker , and Zandieh . A Universal Sampling Method for Reconstructing Signals with Simple Fourier Transforms . STOC 2019.
Chen and Price . Active Regression via Linear-Sample Sparsification . COLT 2019.
Cohen , Kapralov , Musco , and Musco . Input Sparsity Time Low-Rank Approximation via Ridge Leverage Score Sampling . SODA 2017.
Cohen and Peng . ℓ p \ell_p ℓ p Row Sampling by Lewis Weights . STOC 2015.