Introduction

Consider the problem of estimating the trace of a PSD matrix $\boldsymbol{A}$ from a small number of matrix-vector products. There's a variety of algorithms which achieve this goal using merely $\mathcal{O}(1/\varepsilon)$ matrix-vector products. Here, we will show a succint proof of the complementary lower bound: that $\Omega(1/\varepsilon)$ matrix-vector products is necessary in the worst case.

Theorem 1: Trace Estimation Lower Bound
Theorem

Any algorithm that accesses a PSD matrix

\boldsymbol{A}

via

k

(possibly adaptive) matrix-vector products and outputs an estimate

\tilde{t}

\text{tr}(\boldsymbol{A})

such that

\sqrt{\mathbb{E}[|\tilde{t} - \text{tr}(\boldsymbol{A})|^2]} \leq \varepsilon\text{tr}(\boldsymbol{A})

must use at least

k \geq \frac{1}{2\sqrt{2}\varepsilon}

matrix-vector products.

To prove this lower bound, we will need two ingredients: the hidden Wishart theorem, and the conditional expectation.

The Hidden Wishart Theorem

We will rely on the following remarkable result, whose proof we omit. This is the core theorem that enables this entire lower bound technique.

Theorem 2: Hidden Wishart Theorem
Theorem

Let

\boldsymbol{G} \in \mathbb{R}^{n \times n}

be a random matrix with iid

\mathcal{N}(0,1)

entries, and let

\boldsymbol{A} = \boldsymbol{G}^\intercal\boldsymbol{G}

. Suppose an algorithm computes

k

(possibly adaptive) matrix-vector products with

\boldsymbol{A}

, denoted

\mathbf{y}_1 = \boldsymbol{A} \mathbf{x}_1, \ldots, \mathbf{y}_k = \boldsymbol{A} \mathbf{x}_k

Then, there exists a matrix $\boldsymbol{\Delta} \in \mathbb{R}^{n \times n}$ and orthogonal matrix $\boldsymbol{V} \in \mathbb{R}^{n \times n}$ , each constructed deterministically from the queries $\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^k$ , such that

\boldsymbol{V}^\intercal\boldsymbol{A}\boldsymbol{V} = \boldsymbol{\Delta} + \begin{bmatrix} \boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} & \tilde{\boldsymbol{A}} \end{bmatrix},

where $\tilde{\boldsymbol{A}}=\tilde{\boldsymbol{G}}^\intercal\tilde{\boldsymbol{G}} \in \mathbb{R}^{(n-k) \times (n-k)}$ and $\tilde{\boldsymbol{G}} \in \mathbb{R}^{(n-k) \times (n-k)}$ has iid $\mathcal{N}(0,1)$ entries independent of $\{(\mathbf{x}_i, \mathbf{y}_i)\}_{i=1}^k$ .

A succinct proof of Theorem 2 can be found in Appendix B.1 of Amsel et al. (2026).

The matrices $\boldsymbol{A}$ and $\tilde{\boldsymbol{A}}$ follow the Wishart distribution, hence the name of the theorem. The result shows that after $k$ matrix-vector products, there remains a large random component $\tilde{\boldsymbol{A}}\in\mathbb{R}^{(n-k) \times (n-k)}$ of the matrix $\boldsymbol{A}\in\mathbb{R}^{n \times n}$ which is completely independent of the algorithm's matrix-vector queries. This is the titular "hidden Wishart".

Notice that this theorem has robbed the matrix-vector algorithm of any and all agency – no matter how the method chooses its (possibly adaptive) queries, there is always a large random component of the matrix which it has no information about. Since we can characterize a large part of $\boldsymbol{A}$ that the algorithm has no information about at all, we can use simple statistical tools to prove Theorem 1.

A simple statistical observation

In Theorem 1, we care about minimizing the mean squared error of our algorithms given some data about $\boldsymbol{A}$ (namely, a sequence of matrix-vector products). Classical statistics tells us that the best possible algorithm for this goal is the conditional expectation:

Lemma 1: Conditional Expectation Minimizes MSE
Lemma

Let

X

and

Y

be (possibly dependent) random variables. Suppose an algorithm observes

Y

and outputs an estimate

\tilde{X}

X

based on

Y

. Then, the error of

\tilde{X}

is lower bounded as

\mathbb{E}\big[|\tilde{X} - X|^2\big] \geq \mathbb{E}\big[\text{Var}[X\,|\,Y]\big],

and this lower bound is achieved by the conditional expectation $\hat{X} \;{\vcentcolon=}\; \mathbb{E}[X\,|\,Y]$ .

Proof. By the tower rule,

\mathbb{E}\big[|\tilde{X} - X|^2\big] = \mathbb{E}\left[\mathbb{E}\big[|\tilde{X} - X|^2 ~|~ Y\big]\right].

For any fixed $Y=y$ , the inner expectation $\mathbb{E}\big[|\tilde{X} - X|^2 ~|~ Y=y\big]$ is minimized by choosing $\tilde{X} = \mathbb{E}[X\,|\,Y=y]$ . Depending on who you ask, this is either the definition of conditional expectation or a basic fact about it. Either way, we have

\mathbb{E}\big[|\tilde{X} - X|^2 ~|~ Y\big] \geq \mathbb{E}\left[\big| X - \mathbb{E}[X|Y] \big|^2 ~|~ Y\right] = \text{Var}[X\,|\,Y].

Taking the expectation over $Y$ finished the proof.

\blacksquare \, \,

Proof of Trace Estimation Lower Bound

From this result, we can now prove the theorem. It will be helpful to keep in mind that if $Z \sim \chi^2_d$ is a chi-squared random variable with $d$ degrees of freedom, then $\mathbb{E}[Z] = d$ and $\text{Var}[Z] = 2d$ .

Proof. Let $\boldsymbol{A}$ be defined as in Theorem 2. Note that $\mathbb{E}[\text{tr}(\boldsymbol{A})] = \mathbb{E}[\text{tr}(\boldsymbol{G}^\intercal\boldsymbol{G})] = \mathbb{E}[\|\boldsymbol{G}\|_{\rm F}^2] = n^2$ . By Theorem 2, after $k$ matrix-vector products, there exists a decomposition

\boldsymbol{V}^\intercal\boldsymbol{A}\boldsymbol{V} = \boldsymbol{\Delta} + \begin{bmatrix}\boldsymbol{0} & \boldsymbol{0} \\ \boldsymbol{0} & \tilde{\boldsymbol{A}}\end{bmatrix},

where $\tilde{\boldsymbol{A}}\in\mathbb{R}^{(n-k) \times (n-k)}$ is independent of the algorithm's queries. By Lemma 1, the lowest possible error any algorithm can achieve is $\mathbb{E}[\text{Var}[\text{tr}(\boldsymbol{A}) \mid \boldsymbol{\Delta},\boldsymbol{V}]]$ . We note that

\text{tr}(\boldsymbol{A}) = \text{tr}(\boldsymbol{\Delta}) + \text{tr}(\tilde{\boldsymbol{A}}).

Since $\tilde{\boldsymbol{A}}$ is independent of $\boldsymbol{\Delta}$ and $\boldsymbol{V}$ , and since $\text{tr}(\tilde{\boldsymbol{A}})=\|\tilde{\boldsymbol{G}}\|_{\rm F}^2$ has a $\chi^2$ distribution with $(n-k)$ degrees of freedom, we have

\mathbb{E}[|\tilde{t} - \text{tr}(\boldsymbol{A})|^2] \geq \text{Var}[\text{tr}(\boldsymbol{A}) \mid \boldsymbol{\Delta},\boldsymbol{V}] = \text{Var}[\text{tr}(\tilde{\boldsymbol{A}})] = 2(n-k)^2.

So, any estimator that achieves root mean squared error at most $\varepsilon \text{tr}(\boldsymbol{A})$ must satisfy

\sqrt{2}(n-k) \leq \sqrt{\mathbb{E}[|\tilde{t} - \text{tr}(\boldsymbol{A})|^2]} \leq \varepsilon \mathbb{E}[\text{tr}(\boldsymbol{A})] = \varepsilon n^2.

Rearranging this inequality yields

k \geq n - \frac{\varepsilon n^2}{\sqrt{2}}.

Maximizing the right-hand side over $n$ gives the desired lower bound of $k \geq \frac{1}{2\sqrt{2}\varepsilon}$ .

\blacksquare \, \,

References

Braverman, Hazan, Simchowitz, and Woodworth. The gradient complexity of linear regression. COLT 2020.
Jiang, Pham, Woodruff, and Zhang. Optimal Sketching for Trace Estimation. NeurIPS 2021.
Meyer, Musco, Musco, and Woodruff. Hutch++: Optimal stochastic trace estimation. SOSA 2021.
Meyer and Avron. Hutchinson's Estimator is Bad at Kronecker-Trace-Estimation. preprint 2023.
Amsel, Avi, Chen, Keles, Hegde, Musco, Musco, and Persson. Query Efficient Structured Matrix Learning. preprint 2025.
Amsel, Chen, Keles, Halikias, Musco, and Musco. Fixed-sparsity matrix approximation from matrix-vector products. SIMAX 2026.

Introduction

Theorem 1: Trace Estimation Lower Bound Theorem