Introduction

Consider the problem of estimating the trace of a PSD matrix A\boldsymbol{A} from a small number of matrix-vector products. There's a variety of algorithms which achieve this goal using merely O(1/ε)\mathcal{O}(1/\varepsilon) matrix-vector products. Here, we will show a succint proof of the complementary lower bound: that Ω(1/ε)\Omega(1/\varepsilon) matrix-vector products is necessary in the worst case.

Theorem 1: Trace Estimation Lower Bound
Theorem

Any algorithm that accesses a PSD matrix A\boldsymbol{A} via kk (possibly adaptive) matrix-vector products and outputs an estimate t~\tilde{t} of tr(A)\text{tr}(\boldsymbol{A}) such that E[t~tr(A)2]εtr(A)\sqrt{\mathbb{E}[|\tilde{t} - \text{tr}(\boldsymbol{A})|^2]} \leq \varepsilon\text{tr}(\boldsymbol{A}) must use at least k122εk \geq \frac{1}{2\sqrt{2}\varepsilon} matrix-vector products.

To prove this lower bound, we will need two ingredients: the hidden Wishart theorem, and the conditional expectation.

The Hidden Wishart Theorem

We will rely on the following remarkable result, whose proof we omit. This is the core theorem that enables this entire lower bound technique.

Theorem 2: Hidden Wishart Theorem
Theorem

Let GRn×n\boldsymbol{G} \in \mathbb{R}^{n \times n} be a random matrix with iid N(0,1)\mathcal{N}(0,1) entries, and let A=GG\boldsymbol{A} = \boldsymbol{G}^\intercal\boldsymbol{G}. Suppose an algorithm computes kk (possibly adaptive) matrix-vector products with A\boldsymbol{A}, denoted y1=Ax1,,yk=Axk\mathbf{y}_1 = \boldsymbol{A} \mathbf{x}_1, \ldots, \mathbf{y}_k = \boldsymbol{A} \mathbf{x}_k.

Then, there exists a matrix ΔRn×n\boldsymbol{\Delta} \in \mathbb{R}^{n \times n} and orthogonal matrix VRn×n\boldsymbol{V} \in \mathbb{R}^{n \times n}, each constructed deterministically from the queries {(xi,yi)}i=1k\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^k, such that

VAV=Δ+[000A~], \boldsymbol{V}^\intercal\boldsymbol{A}\boldsymbol{V} = \boldsymbol{\Delta} + \begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} & \tilde{\boldsymbol{A}} \end{bmatrix},

where A~=G~G~R(nk)×(nk)\tilde{\boldsymbol{A}}=\tilde{\boldsymbol{G}}^\intercal\tilde{\boldsymbol{G}} \in \mathbb{R}^{(n-k) \times (n-k)} and G~R(nk)×(nk)\tilde{\boldsymbol{G}} \in \mathbb{R}^{(n-k) \times (n-k)} has iid N(0,1)\mathcal{N}(0,1) entries independent of {(xi,yi)}i=1k\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^k.

A succinct proof of Theorem 2 can be found in Appendix B.1 of Amsel et al. (2026).

The matrices A\boldsymbol{A} and A~\tilde{\boldsymbol{A}} follow the Wishart distribution, hence the name of the theorem. The result shows that after kk matrix-vector products, there remains a large random component A~R(nk)×(nk)\tilde{\boldsymbol{A}}\in\mathbb{R}^{(n-k) \times (n-k)} of the matrix ARn×n\boldsymbol{A}\in\mathbb{R}^{n \times n} which is completely independent of the algorithm's matrix-vector queries. This is the titular "hidden Wishart".

Notice that this theorem has robbed the matrix-vector algorithm of any and all agency – no matter how the method chooses its (possibly adaptive) queries, there is always a large random component of the matrix which it has no information about. Since we can characterize a large part of A\boldsymbol{A} that the algorithm has no information about at all, we can use simple statistical tools to prove Theorem 1.

A simple statistical observation

In Theorem 1, we care about minimizing the mean squared error of our algorithms given some data about A\boldsymbol{A} (namely, a sequence of matrix-vector products). Classical statistics tells us that the best possible algorithm for this goal is the conditional expectation:

Lemma 1: Conditional Expectation Minimizes MSE
Lemma

Let XX and YY be (possibly dependent) random variables. Suppose an algorithm observes YY and outputs an estimate X~\tilde{X} of XX based on YY. Then, the error of X~\tilde{X} is lower bounded as

E[X~X2]E[Var[X Y]], \mathbb{E}\big[|\tilde{X} - X|^2\big] \geq \mathbb{E}\big[\text{Var}[X\,|\,Y]\big],

and this lower bound is achieved by the conditional expectation X^ := E[X Y]\hat{X} \;{\vcentcolon=}\; \mathbb{E}[X\,|\,Y].

Proof. By the tower rule,

E[X~X2]=E[E[X~X2 Y]]. \mathbb{E}\big[|\tilde{X} - X|^2\big] = \mathbb{E}\left[\mathbb{E}\big[|\tilde{X} - X|^2 ~|~ Y\big]\right].

For any fixed Y=yY=y, the inner expectation E[X~X2 Y=y]\mathbb{E}\big[|\tilde{X} - X|^2 ~|~ Y=y\big] is minimized by choosing X~=E[X Y=y]\tilde{X} = \mathbb{E}[X\,|\,Y=y]. Depending on who you ask, this is either the definition of conditional expectation or a basic fact about it. Either way, we have

E[X~X2 Y]E[XE[XY]2 Y]=Var[X Y]. \mathbb{E}\big[|\tilde{X} - X|^2 ~|~ Y\big] \geq \mathbb{E}\left[\big| X - \mathbb{E}[X|Y] \big|^2 ~|~ Y\right] = \text{Var}[X\,|\,Y].

Taking the expectation over YY finished the proof.

\blacksquare \, \,

Proof of Trace Estimation Lower Bound

From this result, we can now prove the theorem. It will be helpful to keep in mind that if Zχd2Z \sim \chi^2_d is a chi-squared random variable with dd degrees of freedom, then E[Z]=d\mathbb{E}[Z] = d and Var[Z]=2d\text{Var}[Z] = 2d.

Proof. Let A\boldsymbol{A} be defined as in Theorem 2. Note that E[tr(A)]=E[tr(GG)]=E[GF2]=n2\mathbb{E}[\text{tr}(\boldsymbol{A})] = \mathbb{E}[\text{tr}(\boldsymbol{G}^\intercal\boldsymbol{G})] = \mathbb{E}[\|\boldsymbol{G}\|_{\rm F}^2] = n^2. By Theorem 2, after kk matrix-vector products, there exists a decomposition

VAV=Δ+[000A~], \boldsymbol{V}^\intercal\boldsymbol{A}\boldsymbol{V} = \boldsymbol{\Delta} + \begin{bmatrix}\boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} & \tilde{\boldsymbol{A}}\end{bmatrix},

where A~R(nk)×(nk)\tilde{\boldsymbol{A}}\in\mathbb{R}^{(n-k) \times (n-k)} is independent of the algorithm's queries. By Lemma 1, the lowest possible error any algorithm can achieve is E[Var[tr(A)Δ,V]]\mathbb{E}[\text{Var}[\text{tr}(\boldsymbol{A}) \mid \boldsymbol{\Delta},\boldsymbol{V}]]. We note that

tr(A)=tr(Δ)+tr(A~). \text{tr}(\boldsymbol{A}) = \text{tr}(\boldsymbol{\Delta}) + \text{tr}(\tilde{\boldsymbol{A}}).

Since A~\tilde{\boldsymbol{A}} is independent of Δ\boldsymbol{\Delta} and V\boldsymbol{V}, and since tr(A~)=G~F2\text{tr}(\tilde{\boldsymbol{A}})=\|\tilde{\boldsymbol{G}}\|_{\rm F}^2 has a χ2\chi^2 distribution with (nk)(n-k) degrees of freedom, we have

E[t~tr(A)2]Var[tr(A)Δ,V]=Var[tr(A~)]=2(nk)2. \mathbb{E}[|\tilde{t} - \text{tr}(\boldsymbol{A})|^2] \geq \text{Var}[\text{tr}(\boldsymbol{A}) \mid \boldsymbol{\Delta},\boldsymbol{V}] = \text{Var}[\text{tr}(\tilde{\boldsymbol{A}})] = 2(n-k)^2.

So, any estimator that achieves root mean squared error at most εtr(A)\varepsilon \text{tr}(\boldsymbol{A}) must satisfy

2(nk)E[t~tr(A)2]εE[tr(A)]=εn2. \sqrt{2}(n-k) \leq \sqrt{\mathbb{E}[|\tilde{t} - \text{tr}(\boldsymbol{A})|^2]} \leq \varepsilon \mathbb{E}[\text{tr}(\boldsymbol{A})] = \varepsilon n^2.

Rearranging this inequality yields

knεn22. k \geq n - \frac{\varepsilon n^2}{\sqrt{2}}.

Maximizing the right-hand side over nn gives the desired lower bound of k122εk \geq \frac{1}{2\sqrt{2}\varepsilon}.

\blacksquare \, \,

See Also

There have been many papers that use the Hidden Wishart Theorem to prove matrix-vector complexity lower bounds. The proof here is a special case of an analysis in (Meyer Avron (2023)).

Other relevant papers to this article include:

  • Braverman et al. (2020) Introduces the hidden Wishart method to prove lower bounds for linear regression and eigenvalue estimation.

  • Meyer et al. (2021) Has the first optimal Ω(1/ε)\Omega(1/\varepsilon) lower bounds for trace estimation, but uses more complex methods.

  • Amsel et al. (2026) Gives a succinct proof of the hidden Wishart theorem.

  • Jiang et al. (2021) Sharpens the analysis above to get a nearly tight dependence on failure probability.

  • Amsel et al. (2025) Uses the hidden Wishart method to prove lower bounds for learning the diagonal of a matrix.

  • Let me know if anything is missing. I'm sure many papers are missing.

References