## Deriving PCA from the Courant–Fischer Theorem2021-09-06 ☾
A variational eigenvalue characterization of principle components
Principal Component Analysis (PCA) is one of my favorite algorithms because (1) it is foundational and broadly applicable, and (2) it can be derived using approaches from many different mathematical fields.Typically, these derivations involve optimizing a high-dimensional objective function with Lagrange multipliers. Instead, the Courant–Fischer Theorem, a fundamental theorem from linear algebra, can be used to more simply arrive at a solution to this optimization problem. In this post, we prove the Courant–Fischer Theorem, use it to derive PCA, and finally consider some technical points from the derivation. ## Formulating PCA as a matrix optimization problemOne view of PCA can be motivated by the following question: Given a set of high-dimensional data points, what are the Represent one such high-dimensional data point with a random vector \(x \in \mathbb{R}^D\) and assume the following: \(\mathbb{E}[x] = 0\) (i.e., the vector is **mean-centered**), and\(\mathrm{rank}(\Sigma_x) \geq d\), where \(\Sigma_x\) is the covariance matrix of \(x\) and \(d \ll D\) is the dimensionality of our embedding.
These assumptions lead us to the following definition:
\[ y_i = \langle u_i, x \rangle \in \mathbb{R},\quad u_i \in \mathbb{R}^D \text{ and } \lVert u_i \rVert_2 = 1,\quad i=1,\dots,d \,,\] such that the variance of \(y_i\) is maximized subject to \[ \mathrm{Var}(y_1) \geq \cdots \geq \mathrm{Var}(y_d) > 0 \,. \] We can derive an important characterization of the PCs from this definition:
\[ \require{cancel} \begin{align*} \mathrm{Var}(y_i) &= \mathbb{E}[y_i^2] - \cancel{\mathbb{E}[y_i]^2} &&\text{(Given \(\mathbb{E}[x] = 0\).)} \\ &= \mathbb{E}[u_i^* x u_i^* x] &&\text{(By definition of \(y_i\).)} \\ &= \mathbb{E}[u_i^* x x^* u_i] &&\text{(The dot product is commutative.)} \\ &= u_i^* \mathbb{E}[x x^*] u_i &&\text{(Expectation is linear.)} \\ &= u_i^*\Sigma_x u_i \,. \end{align*} \] Therefore, we are looking for the vectors \(u_i\) that maximize this quadratic form! We need one final result before we can apply the Courant–Fischer Theorem:
\[ \Sigma_x^* = (\mathbb{E}[xx^*] - \mathbb{E}[x]\mathbb{E}[x^*])^* = \mathbb{E}[(xx^*)^*] - \mathbb{E}[x^*]\mathbb{E}[(x^*)^*] = \Sigma_x \,. \] ## Proving the Courant–Fischer TheoremThe Courant–Fischer Theorem provides a very useful bound on all possible quadratic forms of a Hermtian matrix in a subspace using the eigenvalues of the matrix.
\[ \lambda_k = \min_{\substack{\cal{V} \subseteq \mathbb{C}^n \\ \mathrm{dim}\cal{V} = n-k}} \max_{\substack{x \in \cal{V}^\perp \\ x \neq \vec{0}}} \frac{x^* A x}{x^* x} \,. \]
\[ \begin{align*} \dim (\mathcal{U} \cap \mathcal{V^\perp}) &= \dim \cal U + \dim \cal V^\perp - \dim(\cal U \cup \cal V^\perp) \\ &= k + (n - k + 1) - \dim(\cal U \cup \cal V^\perp) \\ &\geq k + (n - k + 1) - n \\ &= 1 \,. \end{align*} \] Let \(w \in \cal U \cap \cal V^\perp\) and express \( w = \sum_{i=1}^k c_i u_i \) (a linear combination of the first \(k\) eigenvectors). Then, \[ \begin{align*} \frac{w^* A w}{w^* w} = \frac{\sum_{i=1}^k \lambda_i c_i^2}{\sum_{i=1}^k c_i^2} \geq \lambda_k \frac{\sum_{i=1}^k c_i^2}{\sum_{i=1}^k c_i^2} = \lambda_k \,, \end{align*} \] with equality if \(w = u_k \Rightarrow \cal V^\perp = \mathrm{span}\{u_1, \dots, u_{k-1}\} \,.\)
\[ \lambda_k = \max_{x \perp \{u_1, \dots, u_{k-1}\}} \frac{x^* A x}{x^* x} \,. \] This vastly simplifies the constrained optimization problem presented in Lemma 1 – no need for Lagrange multipliers! ## Deriving PCA
\[ y_i = \langle u_i, x \rangle \,. \]
\[ \max_{u \in \mathbb{C}^n} \mathrm{Var}(y_1) = \max_{u \in \mathbb{C}^n} \frac{u^* \Sigma_x u}{u^* u} = \lambda_1 \,. \] That is, the first principal component of \(x\) is in the direction of the eigenvector corresponding to the largest eigenvalue. Recall that, by definition, the principal components are mutually uncorrelated. This implies that, unless \(\Sigma_x\) is the zero matrix, \(\lambda_1 > 0 \Rightarrow u_1 \perp u_2\). By the definition of uncorrelatedness, \[ \begin{align*} 0 &= \mathrm{Cov}(y_1, y_2) \\ &= \mathbb{E}[u_1^* x u_2^* x] \\ &= \mathbb{E}[u_1^* x x^* u_2] \\ &= u_1^* \mathbb{E}[x x^*] u_2 \\ &= u_1^*\Sigma_x u_2 \\ &= (\Sigma_x u_1)^* u_2 \\ &= \lambda_1 u_1^*u_2 \,. \end{align*} \] Therefore, \[ \max_{u \perp \{u_1\}} \mathrm{Var}(y_2) = \max_{u \perp \{u_1\}} \frac{u^* \Sigma_x u}{u^* u} = \lambda_2 \,. \] By induction, we can show that the uncorrelatedness of the principal components implies the orthogonality of their underlying vectors, continuing to enable use of the Courant–Fischer Theorem. Thus, PCA has been derived! ## Technical ConsiderationsWhat if \(\lambda_i\) is complex? Which eigenvalue is “largest” in this case? It is true that there is no total ordering on the complex numbers (see here), so there would be no “largest” eigenvalue, as such… However, we have shown that \(\Sigma_x\) is Hermitian, so \(\sigma(\Sigma_x) \subseteq \mathbb{R}\,.\) This is because Hermitian matrices are unitary diagonalizable, and the diagonal matrix, which is comprised of the eigenvalues, is itself Hermitian. Okay, but what if \(\lambda_i < 0\)? That wouldn't be a valid variance. Any covariance matrix \(\Sigma_x\) is positive semi-definite, so we also know that the eigenvalues of \(\Sigma_x\) are non-negative, i.e., \(\sigma(\Sigma_x) \subseteq \mathbb{R}_{\geq 0}\,.\) For any vector \(z \neq \vec{0}\), \[ \begin{align*} z^* \Sigma_x z &= \mathbb{E}[z^* xx^* z] - \mathbb{E}[z^* x]\mathbb{E}[x^* z] \\ &= \mathbb{E}[(z^* x)^2] - \mathbb{E}[z^* x]^2 \\ &= \mathrm{V}(z^* x) \\ &\geq 0 \,. \end{align*} \] What happens if \(\lambda_i = \lambda_j\) for \(i \neq j\)? Recall that \(\Sigma_x\) is Hermitian, and therefore is unitarily diagonalizable. This implies that every eigenvalue of \(\Sigma_x\) has equal geometric and algebraic multiplicities (this follows from the Jordan Canonical Form!). Therefore, for repeated eigenvalues, we can pick any orthonormal eigenvectors from the eigenspace of \(\lambda = \lambda_i = \lambda_j\), meaning the PCs are defined only up to a rotation. ## AcknowledgementsThis post synthesizes ideas I encountered in Matrix Analysis (JHU.AMS.792) and Unsupervised Learning (JHU.BME.692). The second chapter of Vidal et al. (2015) presents PCA from three different perspectives: statistical, geometric, and rank-minimization. |