Vivek Gopalakrishnan

Some Spectral Properties of Markov Chains

2021-11-10 | The power of Gershgorin and his Circles

Markov chains are stoachastic models for describing discrete sequences of random behavior. While I normally dislike probability theory because I find it incredibly confusing, I like Markov chains because they can be expressed using linear algebra (which I find much less confusing!). This post explores the spectral properties of Markov chains[1] and relates this to the limiting probabilistic behaviors of Markov chains.

Three main results will be proved in this post (still need to prove #4):[2]

  1. The eigenvalues of a Markov chain are bounded within the unit circle

  2. The number of stationary distributions a Markov chain has is related to the geometric multiplicity of λ=1\lambda=1

  3. A large class of Markov chains have a unique stationary distribution

  4. If a Markov chain is not diagonalizable, its stationary distributions are calculable using the Jordan Canonical Form

[1] The spectrum of a matrix is the set of its eigenvalues. Spectral properties are then the properties of the eigenvalues that give rise to special behavior.
[2] Said more precisely, Result 2 really characterizes the dimension of the space of stationary distributions.

Gershgorin and His Circles

A Markov chain describes the probability of transitioning from one state to another. These transition probabilities can be summarized using a Markov matrix:

Definition 1 (Markov matrix). A matrix MMn([0,1])M \in M_n([0, 1]) is a Markov matrix if the entries of each column sum to 1. Said another way, if ii and jj are states in a Markov chain, then P(ji)=Mij\mathbb P( j \mapsto i ) = M_{ij}

The eigenvalues of a Markov matrix have a remarkable property: they're never bigger than 1 in magnitude! To prove this, we need an underrated result from linear algebra:

Theorem 1 (Gershgorin disc theorem). For the ii-th row of a matrix AA,

let the row sum Ri=jiAijR_i = \sum_{j \neq i} A_{ij} be the radius of the disc centered at AiiA_{ii}: Gi(A)={zC:zAiiRi}G_i(A) = \{ z \in \mathbb C : | z - A_{ii} | \leq R_i \}. Let the union of these discs be G(A)=i=1nGi(A) .G(A) = \cup_{i=1}^n G_i(A) \,. Then, σ(A)G(A) .\sigma(A) \subseteq G(A) \,.[3]
Proof: A good explanation is avilable on Wikipedia.

[3] Note that σ(A)={λC:Ax=λx}\sigma(A) = \{\lambda \in \mathbb C : Ax = \lambda x\} represents the set of all eigenvalues of AA.
We can augment the Gershgorin disc theorem with a quick lemma connecting the eigenvalues of AA and AA^*:

Lemma 1. Given a matrix AMnA \in M_n, σ(A)=σ(A)\sigma(A) = \sigma(A^*).
Proof: The lemma holds since AA and AA^* have the same characteristic polynomial: det(AλI)=det((AλI))=det(AλI) .\det(A^*- \lambda I) = \det((A - \lambda I)^*) = \det(A - \lambda I) \,.

Thanks to Lemma 1, we can equivalently define RiR_i in the Gershgorin disc theorem to be the column sum, ijAij\sum_{i \neq j} A_{ij}, instead of the row sum. Now, the first main result follows:

Corollary 1 (Spectrum of Markov matrices). The eigenvalues of a Markov matrix MMn([0,1]) ,M \in M_n([0, 1]) \,, are bounded within the unit circle.[4]
Proof: Since the columns of MM must sum to 11, this means that for every column ii, Mii+Ri=1M_{ii} + R_i = 1. Therefore, the disc Gi(M)G_i(M) is a subset of the unit circle that also intersects the unit circle at (1,0)(1, 0).

Therefore, by Theorem 1, σ(M)G(M){zC:z1}\sigma(M) \subseteq G(M) \subseteq \{ z \in \mathbb C : |z| \leq 1 \}.

Here is an illustration of this proof for a Markov matrix in M4([0,1])M_4([0,1]). Interestingly, this picture also shows us that the only way we can have an complex eigenvalue with magnitude 1 is if one of the Mii=0M_{ii} = 0... more on this later.

[4] Said mathematically, σ(M){zC:z1} .\sigma(M) \subseteq \{ z \in \mathbb C : |z| \leq 1 \} \,.
markov-gershgorin

Next, we ask what is the largest eigenvalue of a Markov matrix?

Corollary 2 (Spectral radius of Markov matrices). The spectral radius of a Markov matrix is ρ(M)=1\rho(M) = 1 and ρ(M)σ(M)\rho(M) \in \sigma(M).
Proof: Let 1Rn\vec{1} \in \mathbb R^n be a column vector of all 1s. We can then write the column sum constraint as M1=1M^*\vec{1} = \vec{1}. Note that this equation shows that λ=1\lambda = 1 is an eigenvalue of MM^* (and by Lemma 1, also an eigenvalue of MM). Because the eigenvalues of MM are constrained within the unit circle by Corollary 1, λ=1\lambda = 1 achieves the maximum possible modulus.

Therefore, ρ(M)=1\rho(M) = 1 and ρ(M)σ(M)\rho(M) \in \sigma(M).

Enumerating the Stationary Distributions

Remark: Gershgorin gives us a very powerful sets of results! They show that if our Markov matrix is diagonalizable (i.e., M=SDS1M = SDS^{-1} for SMnS \in M_n invertible and DMnD \in M_n diagonal), then Mk=SDkS1M^k = SD^kS^{-1} will converge to some matrix as kk \to \infty, and the steady state for any initial condition will be determined by the eigenvectors of MM whose eigenvalues equal the spectral radius.[5]

[5] What happens if MM is not diagonalizable? In this case, we have to use the Jordan Canonical Form (JCF) of MM (more on this later).
For a cool application of this convergence idea, let's consider a symmetric Markov chain (i.e., forwards and backwards probability are equal). The spectral theorem tells us that the eigenvectors of this matrix are orthogonal. Additionally, assume that the geometric multiplicity of the eigenvector 11 is kk for this example. Then,

limkMk=limkSDkS1=S(Ik0nk)S1=(s1sk00)(s1sn)=i=1ksisi .\begin{aligned} \lim_{k \rightarrow \infty} M^k &= \lim_{k \rightarrow \infty} SD^kS^{-1} \\ &= S (I_k \oplus \mathbf 0_{n-k}) S^{-1} \\ &= \begin{pmatrix} s_1 \cdots s_k & \vec 0 \cdots \vec 0 \end{pmatrix} \begin{pmatrix} s_1 & \cdots & s_n \end{pmatrix}^* \\ &= \sum_{i=1}^k s_i s_i^* \,. \end{aligned}

Since the eigenvectors of MM are orthonormal, note that sisis_i s_i^* is a matrix representing the orthogonal projection onto sis_i. Thus, PSk=i=1ksisiP_{S_k} = \sum_{i=1}^k s_i s_i^* is a projection matrix onto the span of {s1,,sk}\{s_1, \dots, s_k\}. Finally, we can see that for some initial condition x0x_0,

limkMkx0=PSkx0 , \lim_{k \rightarrow \infty} M^k x_0 = P_{S_k} x_0 \,,

meaning the stationary distributions of MM are an orthogonal projection onto the eigenspace of λ=1\lambda = 1. Said another way, any convex combination of the eigenvectors of MM with eigenvalue 11 is a valid stationary distribution.

So, in general, when does MM have a unique stationary distribution (i.e., when is the geometric multiplicity of 11 equal to 11)?

The Perron–Frobenius Theorem

Definition 3 (Positive and non-negative matrices). A matrix AMn(R)A \in M_n(\mathbb R) is positive (written A>0A > 0) if all entries of AA are positive. Similarly, if all the entries of AA are non-negative, then AA is non-negative as well (i.e., A0A \geq 0).

The Perron–Frobenius Theorem gives many results about positive matrices. To study Markov matrices, we focus on one of them:

Theorem 2 (Perron–Frobenius). If AMn(R)A \in M_n(\mathbb R) is positive, then ρ(A)\rho(A) is an eigenvalue of AA with algebraic multiplicity 1.

Corollary 3. Theorem 2 still applies if AA is non-negative and irreducible.[6]

[6] Irreducibility is related to the concept of the connectedness of a matrix. Specifically, matrices can be represented as entry digraphs (take the matrix, binarize it, and treat that as the adjacency matrix of a directed graph). If the graph is strongly connected (i.e., each vertex is connected to every other vertex by a directed edge), then it is irreducible.
By definition, Markov matrices are non-negative. If a Markov matrix is irreducible, then the Perron–Frobenius theorem says that ρ(M)=1\rho(M)=1 is a simple eigenvalue of MM. Since the geometric multiplicity of an eigenvalue is bounded above by its algebraic multiplicity, that means that there is only a single eigenvector xx such that Ax=xAx = x. This vector xx, scaled so that its entries sum to 1, is the unique steady state distribution of MM.

The defintion of irreducibility also helps us understand the result proved in the previous section. If a matrix is reducible, there are multiple strongly connected components. These are communities in the graph that, once you walk into, you cannot walk out of.[7] Such a community would have an eigenvector with non-zero entry for each vertex in the strongly connected component and 00's for every other vertex, and this eigenvector would be a stationary distribution. Each strongly connected component would have its own such eigenvector, and their convex hull would be the space of all possible stationary distributions. This is exactly what we proved in the previous section, but thanks to this spectral graph theory perspective, we now we don't need MM to be symmetric to prove it!

[7] You can check out any time you like, but you can never leave.

Diagonalization with the Jordan Canonical Form

The past results have assumed that MM is diagonalizable. What if this is not true?

I'm not sure how the math works out really, but I'll endeavor to prove it some day :)

Future Questions to Answer

These will slowly be addressed in future versions of this post.