2021-09-06 | *A variational eigenvalue characterization of principle components*

Principal Component Analysis (PCA) is one of my favorite algorithms because (1) it is foundational and broadly applicable, and (2) it can be derived using approaches from many different mathematical fields.^{[1]}

[1] | The second chapter of Vidal et al. (2015) presents PCA from three different perspectives: statistical, geometric, and rank-minimization. |

[2] | This post synthesizes ideas I encountered in Matrix Analysis (JHU.AMS.792) and Unsupervised Learning (JHU.BME.692). |

One view of PCA can be motivated by the following question:

Given a set of high-dimensional data points, what are the

directions of maximal variance?

Represent one such high-dimensional data point with a random vector \(x \in \mathbb{R}^D\) and assume the following:

\(\mathbb{E}[x] = 0\) (i.e., the vector is

*mean-centered*), and\(\mathrm{rank}(\Sigma_x) \geq d\), where \(\Sigma_x\) is the covariance matrix of \(x\) and \(d \ll D\) is the dimensionality of our embedding.

These assumptions lead us to the following definition:

**Definition 1 (Principal Components of \(x\)).** The \(d\) principal components (PCs) of \(x\) are a set of mutually uncorrelated^{[3]} random variables \(\vec{y} \in \mathbb{R}^d\) defined as

such that the variance of \(y_i\) is maximized subject to

\[\mathrm{Var}(y_1) \geq \cdots \geq \mathrm{Var}(y_d) > 0 \,.\][3] | That is, \(\mathrm{cov}(y_i, y_j) = 0\) for all \(i \neq j\). |

**Lemma 1.** \(\mathrm{Var}(y_i) = u_i^*\Sigma_x u_i \,.\) *Proof:* This follows from the definition of variance:

\[\begin{aligned} \mathrm{Var}(y_i) &= \mathbb{E}[y_i^2] - \cancel{\mathbb{E}[y_i]^2} &&\text{(Given $\mathbb{E}[x] = 0$.)} \\ &= \mathbb{E}[u_i^* x u_i^* x] &&\text{(By definition of $y_i$.)} \\ &= \mathbb{E}[u_i^* x x^* u_i] &&\text{(The dot product is commutative.)} \\ &= u_i^* \mathbb{E}[x x^*] u_i &&\text{(Expectation is linear.)} \\ &= u_i^*\Sigma_x u_i \,. \end{aligned}\]

Therefore, we are looking for the vectors \(u_i\) that maximize this quadratic form!

We need one final result before we can apply the Courant–Fischer Theorem:

**Lemma 2.** The covariance matrix of a random variable is Hermitian.^{[4]} *Proof:*

[4] | A matrix \(A\) is Hermitian if \(A = A^*\) (i.e., a more general form of symmetry that applies to complex matrices). |

The Courant–Fischer Theorem provides a very useful bound on all possible quadratic forms of a Hermtian matrix in a subspace using the eigenvalues of the matrix.

**Theorem 1 (Courant–Fischer).** Let \(A \in M_n\) be a Hermitian matrix with eigenvalues \(\lambda_1 \geq \cdots \geq \lambda_n\) and corresponding eigenvectors \(u_1, \dots, u_n\). Then,

*Proof:* Because \(A\) is Hermitian, it is unitarily diagonalizable. Let \(\cal{U} = \mathrm{span}\{ u_1, \dots, u_k \}\). Then, the intersection of \(\cal U\) and \(\cal V^\perp\) has a dimension of at least \(1\).^{[5]}

[5] | From the Inclusion-Exclusion Principle: \[\begin{aligned} \dim (\mathcal{U} \cap \mathcal{V^\perp}) &= \dim \cal U + \dim \cal V^\perp - \dim(\cal U \cup \cal V^\perp) \\ &= k + (n - k + 1) - \dim(\cal U \cup \cal V^\perp) \\ &\geq k + (n - k + 1) - n \\ &= 1 \,. \end{aligned}\] |

with equality if \(w = u_k \Rightarrow \cal V^\perp = \mathrm{span}\{u_1, \dots, u_{k-1}\} \,.\)

**Corollary 1.** As a direct consequence of the Courant–Fischer Theorem, for any matrix \(A\) as in Theorem 1,

This vastly simplifies the constrained optimization problem presented in Lemma 1 – no need for Lagrange multipliers!

**Theorem 2 (Principal Component Analysis).** Given a random variable \(x\) with covariance matrix \(\Sigma_x \in M_n\) with eigenvalues \(\lambda_1 \geq \cdots \geq \lambda_n\) and associated normalized eigenvectors \(u_1, \dots, u_n\), the \(i\)-th principal component of \(x\) is

*Proof:* From Lemma 1, we have that \( \mathrm{Var}(y_i) = u_i^* \Sigma_x u_i \text{ with } \lVert u_i \rVert_2 = 1 \). From the Courant–Fischer Theorem, we have that

That is, the first principal component of \(x\) is in the direction of the eigenvector corresponding to the largest eigenvalue. Recall that, by definition, the principal components are mutually uncorrelated. This implies that, unless \(\Sigma_x\) is the zero matrix, \(\lambda_1 > 0 \Rightarrow u_1 \perp u_2\).^{[6]}

Therefore,

\[ \max_{u \perp \{u_1\}} \mathrm{Var}(y_2) = \max_{u \perp \{u_1\}} \frac{u^* \Sigma_x u}{u^* u} = \lambda_2 \,.\]By induction, we can show that the uncorrelatedness of the principal components implies the orthogonality of their underlying vectors, continuing to enable use of the Courant–Fischer Theorem. Thus, PCA has been derived!

[6] | By the definition of uncorrelatedness, \[\begin{aligned} 0 &= \mathrm{Cov}(y_1, y_2) \\ &= \mathbb{E}[u_1^* x u_2^* x] \\ &= \mathbb{E}[u_1^* x x^* u_2] \\ &= u_1^* \mathbb{E}[x x^*] u_2 \\ &= u_1^*\Sigma_x u_2 \\ &= (\Sigma_x u_1)^* u_2 \\ &= \lambda_1 u_1^*u_2 \,. \end{aligned}\] |

What if \(\lambda_i\) is complex? Which eigenvalue is "largest" in this case?

It is true that there is no total ordering on the complex numbers (see here), so there would be no "largest" eigenvalue, as such... However, we have shown that \(\Sigma_x\) is Hermitian, so \(\sigma(\Sigma_x) \subseteq \mathbb{R}\,.\)^{[7]}

[7] | This is because Hermitian matrices are unitary diagonalizable, and the diagonal matrix, which is comprised of the eigenvalues, is itself Hermitian. |

Okay, but what if \(\lambda_i < 0\)? That wouldn't be a valid variance.

Any covariance matrix \(\Sigma_x\) is positive semi-definite,^{[8]} so we also know that the eigenvalues of \(\Sigma_x\) are non-negative, i.e., \(\sigma(\Sigma_x) \subseteq \mathbb{R}_{\geq 0}\,.\)

[8] | For any vector \(z \neq \vec{0}\), \[\begin{aligned} z^* \Sigma_x z &= \mathbb{E}[z^* xx^* z] - \mathbb{E}[z^* x]\mathbb{E}[x^* z] \\ &= \mathbb{E}[(z^* x)^2] - \mathbb{E}[z^* x]^2 \\ &= \mathrm{V}(z^* x) \\ &\geq 0 \,. \end{aligned}\] |

What happens if \(\lambda_i = \lambda_j\) for \(i \neq j\)?

Recall that \(\Sigma_x\) is Hermitian, and therefore is unitarily diagonalizable. This implies that every eigenvalue of \(\Sigma_x\) has equal geometric and algebraic multiplicities.^{[9]} Therefore, for repeated eigenvalues, we can pick any orthonormal eigenvectors from the eigenspace of \(\lambda = \lambda_i = \lambda_j\), meaning the PCs are defined only up to a rotation.

[9] | This follows from the Jordan Canonical Form! |

© Vivek Gopakrishnan. Last modified: September 16, 2022. Website built with Franklin.jl and the Julia programming language.