Basic Properties of Leverage Scores

On this page, we will show that three different definitions of Leverage Scores are equivalent. As a small supplement, we also use two of these forms to bound the maximum value of a leverage score, and also to compute the sum of all leverage scores.

Definition 1: Max Characterization
Definition

Let ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d}. Then the Maximum Characterization of the Leverage Score for row ii of A\boldsymbol{A} is

τi(A) := maxxRd[Ax]i2Ax22 \tau_i(\boldsymbol{A}) \;{\vcentcolon=}\; \max_{\mathbf{x}\in\mathbb{R}^d} \frac{[\boldsymbol{A}\mathbf{x}]_i^2}{\|\boldsymbol{A}\mathbf{x}\|_2^2}


Lemma 1: Inner Product Characterization
Lemma

Let ai\mathbf{a}_i be the ithi^{th} row of A\boldsymbol{A}. Then,

τi(A)=ai(AA)1ai \tau_i(\boldsymbol{A}) = \mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i

Proof. This is proven in two steps. First, we show that the inner product characterization upper bounds the max characterization. Then, we show a matching lower bound. For simplicity, we prove this for full-rank A\boldsymbol{A}.

Before getting started, we note a useful equation: A(AA)1ai22=(ai(AA)1AA(AA)1ai)=(ai(AA)1ai)\begin{aligned} \|\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i\|_2^2 &= (\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\boldsymbol{A}^\intercal\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i) \\ &= (\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i) \end{aligned}

Upper Bound: To create the upper bound, we relate [Ax]i2=(aix)2=(ai(AA)1(AA)x)2=((A(AA)1ai) (Ax))2A(AA)1ai22Ax22=(ai(AA)1ai)Ax22\begin{aligned} [\boldsymbol{A}\mathbf{x}]_i^2 &= (\mathbf{a}_i^\intercal\mathbf{x})^2 \\ &= (\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}(\boldsymbol{A}^\intercal\boldsymbol{A})\mathbf{x})^2 \\ &= ((\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i)^\intercal ~ (\boldsymbol{A}\mathbf{x}))^2 \\ &\leq \|\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i\|_2^2 \cdot \|\boldsymbol{A}\mathbf{x}\|_2^2 \\ &= (\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i) \cdot \|\boldsymbol{A}\mathbf{x}\|_2^2 \end{aligned} Where the inequality used is the Cauchy-Schwarz Inequality: (vy)2v22y22(\mathbf{v}^\intercal\mathbf{y})^2\leq \|\mathbf{v}\|_2^2 \cdot \|\mathbf{y}\|_2^2. We can then give an upper bound to the max characterization:

τi(A)=maxxRd[Ax]i2A22maxxRd(ai(AA)1ai)Ax22Ax22=ai(AA)1ai \tau_i(\boldsymbol{A}) = \max_{\mathbf{x}\in\mathbb{R}^d} \frac{[\boldsymbol{A}\mathbf{x}]_i^2}{\|\boldsymbol{A}\|_2^2} \leq \max_{\mathbf{x}\in\mathbb{R}^d} \frac{(\mathbf{a}_i(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i) \cdot \|\boldsymbol{A}\mathbf{x}\|_2^2}{\|\boldsymbol{A}\mathbf{x}\|_2^2} = \mathbf{a}_i(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i

Lower Bound: For the lower bound, we just plug x=(AA)1ai\mathbf{x}=(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i into the max characterization: τi(A)[A(AA)1ai]i2A(AA)1ai22=(ai(AA)1ai)2ai(AA)1ai=ai(AA)1ai\begin{aligned} \tau_i(\boldsymbol{A}) \geq \frac{[\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i]_i^2}{\|\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i\|_2^2} = \frac{(\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i)^2}{\mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i} = \mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i \end{aligned} Which completes the proof.

\blacksquare \, \,


Lemma 2: Minimum Characterization
Lemma

Let ai\mathbf{a}_i be the ithi^{th} row of A\boldsymbol{A}. Then

τi(A)=minyRn, Ay=aiy22 \tau_i(\boldsymbol{A}) = \min_{\mathbf{y}\in\mathbb{R}^n,\,\boldsymbol{A}^\intercal\mathbf{y}=\mathbf{a}_i} \|\mathbf{y}\|_2^2

Proof. For simplicity, we assume that A\boldsymbol{A} is both full-rank and tall-and-skinny.

Notice this minimization problem a minimum-norm underdetermined least-squares problem, with known solution y=A(AA)1ai\mathbf{y}=\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i. So, we know that

minyRn, Ay=aiy22=A(AA)1ai22=ai(AA)1ai=τi(A) \min_{\mathbf{y}\in\mathbb{R}^n,\,\boldsymbol{A}^\intercal\mathbf{y}=\mathbf{a}_i} \|\mathbf{y}\|_2^2 = \|\boldsymbol{A}(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i\|_2^2 = \mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i = \tau_i(\boldsymbol{A})

where the second equality is shown in the start to the proof of Lemma 1, and the last equality is Lemma 1.

\blacksquare \, \,


Lemma 3: Basic Properties of τi(A)\tau_i(\boldsymbol{A})
Lemma

Let ARn×d\boldsymbol{A}\in\mathbb{R}^{n \times d} full-rank with ndn \geq d. Then,

  1. Each leverage score has τi(A)[0,1]\tau_i(\boldsymbol{A})\in[0,1]

  2. If BRd×d\boldsymbol{B}\in\mathbb{R}^{d \times d} is full-rank, then τi(AB)=τi(A)\tau_i(\boldsymbol{A}\boldsymbol{B}) = \tau_i(\boldsymbol{A})

  3. If AA=I\boldsymbol{A}^\intercal\boldsymbol{A}=\boldsymbol{I}, then τi(A)=ai22\tau_i(\boldsymbol{A})=\|\mathbf{a}_i\|_2^2

  4. If URn×d\boldsymbol{U}\in\mathbb{R}^{n \times d} with UU=I\boldsymbol{U}^\intercal\boldsymbol{U}=\boldsymbol{I} has the same columnspace as A\boldsymbol{A}, then τi(A)=ui22\tau_i(\boldsymbol{A})=\|\mathbf{u}_i\|_2^2 where ui\mathbf{u}_i is the ithi^{th} row of U\boldsymbol{U}.

  5. The sum of leverages is i=1nτi(A)=d\sum_{i=1}^n \tau_i(\boldsymbol{A})=d

Proof.

Point 1 follows directly from the Max Characterization (Definition 1).

Point 2 also follows form the Max Characterization. In particular, for any xRd\mathbf{x}\in\mathbb{R}^{d} we can define y := Bx\mathbf{y}\;{\vcentcolon=}\;\boldsymbol{B}\mathbf{x}. Since B\boldsymbol{B} is invertible, the maximizing over all xRd\mathbf{x}\in\mathbb{R}^d is equivalent to maximizing over all yRd\mathbf{y}\in\mathbb{R}^d:

τi(AB)=maxxRd[ABx]i2ABx22=maxyRd[Ay]i2Ay22=τi(A) \tau_i(\boldsymbol{A}\boldsymbol{B}) = \max_{\mathbf{x}\in\mathbb{R}^d} \frac{[\boldsymbol{A}\boldsymbol{B}\mathbf{x}]_i^2}{\|\boldsymbol{A}\boldsymbol{B}\mathbf{x}\|_2^2} = \max_{\mathbf{y}\in\mathbb{R}^d} \frac{[\boldsymbol{A}\mathbf{y}]_i^2}{\|\boldsymbol{A}\mathbf{y}\|_2^2} = \tau_i(\boldsymbol{A})

Point 3 follows from the Inner Product Characterization:

τi(A)=ai(AA)1ai=aiai=ai22 \tau_i(\boldsymbol{A}) = \mathbf{a}_i^\intercal(\boldsymbol{A}^\intercal\boldsymbol{A})^{-1}\mathbf{a}_i = \mathbf{a}_i^\intercal\mathbf{a}_i = \|\mathbf{a}_i\|_2^2

Point 4 follows from Points 2 and 3.

Point 5 follows from Point 4: Since every A\boldsymbol{A} has an SVD A=UΣV\boldsymbol{A}=\boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^\intercal, we can find an orthogonal U\boldsymbol{U} that satisfied Point 4. Let ui\mathbf{u}_i denote the ithi^{th} row of U\boldsymbol{U} and let u^j\hat{\mathbf{u}}_j denote the jthj^{th} column of U\boldsymbol{U}. Then,

i=1nτi(A)=i=1nui22=i=1nj=1dUij2=j=1du^j22=d \sum_{i=1}^n \tau_i(\boldsymbol{A}) = \sum_{i=1}^n \|\mathbf{u}_i\|_2^2 = \sum_{i=1}^n \sum_{j=1}^d \boldsymbol{U}_{ij}^2 = \sum_{j=1}^d \|\hat{\mathbf{u}}_j\|_2^2 = d
\blacksquare \, \,

There are some intuitive implications from these bullet points:

  • The leverage scores of A\boldsymbol{A} depend only on the range of A\boldsymbol{A}. Any other matrix with the same column space has the same leverage scores.

  • If A=QR\boldsymbol{A}=\boldsymbol{Q}\boldsymbol{R} is the economic QR decomposition of A\boldsymbol{A}, and if q1,,qn\mathbf{q}_1,\ldots,\mathbf{q}_n are the rows of Q\boldsymbol{Q}, then τi(A)=q22\tau_i(\boldsymbol{A})=\|\mathbf{q}\|_2^2.

See Also

The proofs above are most closely related to their highly generalized analysis in Theorem 5 from Avron et al. (2019).

There are many extensions of leverage scores in the literature:

  • (Cohen et al. (2017)) Ridge Leverage Scores, which relate to Ridge Regression.

  • (Alaoui Mahoney (2015)) Kernel Ridge Leverage Scores, which relate to Kernel Ridge Regression.

  • (Avron et al. (2017)) Kernel Ridge Leverage Function, which relates to Random Fourier Features for Kernel Ridge Regression.

  • (Chen Price (2019)) Leverage Function, which relates to Linear Operators instead of matrices. Sometimes called a Sensitvitiy Score.

  • (Avron et al. (2019)) Ridge Leverage Function, which relates to Linear Operators instead of matrices for Ridge Regression .

  • (Cohen Peng (2019)) Lewis Weights, which relate to LpL_p norms instead of just L2L_2 norms.

  • Let me know if anything is missing

Bibliography