Vivek Gopalakrishnan

The Conjugate Gradient Method

2022-07-06 | Geometric underpinnings, eigenvalues, and the convergence rate

The goals of this post are:[1]

  1. To derive the conjugate gradient method,

  2. Implement and benchmark various versions with Julia, and

  3. Explain the connection between eigenvalues and the convergence rate.

The code in this post can be viewed in a Jupyter Notebook here.

[1] Notes from the Medical Vision Group's summer reading of Numerical Optimization by J. Nocedal, Stephen J. Wright, 1999.

Setup

The goal of the conjugate gradient algorithm is to iteratively solve linear systems of the form Ax=b Ax = b \, where we assume that AMn(R)A \in M_n(\mathbb R) is symmetric positive definite.[2] It turns out that solving such a linear system is equivalent to minimizing the quadratic form

f(x)=xTAxbTx+c .f(x) = x^TAx - b^Tx + c \,.

To see this, look at its gradient:

f(x)=Axb .\nabla f(x) = Ax - b \,.

The gradient equals zero exactly at the x=xx=x^* that minimizes the the residual of the linear system r(x)=Axb .r(x^*) = Ax^* - b\,.[3] Additionally, the Hessian matrix of ff is AA, which is positive definite, so it has exactly one optimum, which is a minumum!

[2] This may seem to be a strong assumption, but we will soon show how to bypass this for any general rectangular matrix.
[3] We will write that the kk-th iterate xkx_k has a residual of rk=f(xk)=Axkb .r_k = \nabla f(x_k) = Ax_k - b \,.
To minimize the quadratic form, we could use steepest descent, which is gradient descent where you take the largest step possible to minimize the loss. However, for certain linear systems, gradient descent converges very slowly (slowly here means that the number of steps needed for convergence is n\gg n, the dimension of AA). Instead, conjugate gradient gives an iterative optimization that is guaranteed to converge in at most nn steps...[4]

[4] ...in exact arithmetic – floating point accuracy means it can take a little longer, but it's still faster than gradient descent.

Understanding conjugacy

In essence, the conjugate gradient method is simply a change of basis. However, the basis we seek is very special: specifically, we are looking for a set of basis vectors that are conjugate with respect to AA. Two vectors uu and vv are conjugate if they satisfy a special kind of orthogonality:

uTAv=vTAv=0 . u^T A v = v^T A v = 0 \,.

This means that after uu (or vv) is transformed by AA, it is orthogonal to vv (or uu). This basis is very useful for finding the solution to a linear system, as we'll show below. First, a quick lemma:

Linear independence of conjugate vectors

Lemma 1. If uu and vv are conjugate with respect to AA, then they are also linearly independent.
Proof: Suppose, by way of contradiction, uu and vv are not linearly independent. Then, there exists some non-zero constant kk such that u=kvu = kv. This implies 0=uTAv=kvTAvvTAv=00 = u^TAv = kv^TAv \Longrightarrow v^TAv = 0. However, this is a contradiction because AA positive definite means vTAv>0 .v^TAv > 0 \,.

By induction, we can also show that a set of conjugate vectors are linearly independent.

Conjugate directions method

Suppose that we have a conjugate basis {p1,,pn}\{p_1, \dots, p_n\} with respect to AA. Since these vectors are linearly independent, we can express xx^* as

x=α1p1++αnpnAx=b=α1Ap1++αnApn . x^* = \alpha_1 p_1 + \cdots + \alpha_n p_n \Longrightarrow Ax^* = b = \alpha_1 Ap_1 + \cdots + \alpha_n Ap_n \,.

Premultiplying by the vector pkp_k, we see that

pkTb=αkpkTApk , p_k^T b = \alpha_k p_k^T A p_k \,,

since the other terms cancel out by conjugacy! Therefore, we have that the coefficients are

αk=pkTbpkTApk , \alpha_k = \frac{p_k^T b}{p_k^T A p_k} \,,

which are all quantities we know how to compute. That is, changing our basis to a conjugate one makes it very easy to solve a linear system.

This simple results tells us two important facts:

  1. If we have a procedure that produces a conjugate basis vector at each step, we can solve a linear system in at most nn steps.

  2. If we have a set of conjugate basis vectors for AA, it is trivial to solve a linear system. The brilliance of the conjugate gradient method is in how we find these vectors.

Visualizing conjugate directions and condition number

Before discussing the generating procedure, it's useful to visualize conjugacy and our loss landscape.

Since we are working with positive definite matrices, it's useful to have a function to randomly generate them. The procedure we use leverages the fact that ATAA^T A is gauranteed to be a positive semidefinite matrix (use the definition of positive semidefiniteness to prove this).[5] Therefore, we almost always generate a positive definite matrix by samping a random square matrix and premultiplying it by its transpose.

function random_psd(n::Int64=2)
    A = randn(n, 2)
    return A' * A
end

[5] Note that ATAA^T A is positive semidefinite if and only if rank(A)<n\mathrm{rank}(A) < n, and the probability of this happening is very small.
Next, we plot the level sets of the quadratic form defined by our positive definite AA. That is, we visualize multiple elliptical curves along which xTAx=cx^TAx = c for some constant cc. These let us visualize the loss landscape we are trying to optimize.

The final concept we will discuss with these curves is the condition number. The condition number is the ratio between the largest and smallest eigenvalues. When this number is small (i.e., closer to 1), the closer the ellipses are to a circle (left). This corresponds to a system that is more amenable to gradient descent: you can pick any direction to descent and make good progress. For a system with a large condition number (right), some directions are much more fruitful than others. This means gradient descent can take a very long time if you choose a poor starting point for the optimization.

conjugate-gradient-ellipses

Deriving the conjugate gradient method

Here, we derive conjugate gradient as an iterative method. Note, the first search direction p0=r0p_0 = -r_0 since it is the negative gradient.

Finding the optimal step size αk\alpha_k

Assume we start at some point x0x_0. For a set of conjugate directions {p0,,pk}\{p_0, \dots, p_k\}, we define the update function as

xk+1=xk+αkpk ,(Eq. 1) x_{k+1} = x_k + \alpha_k p_k \,, \quad\quad\text{(Eq. 1)}

where αk\alpha_k is the length that optimally descends pkp_k. To find αk\alpha_k, we define g(αk)=f(xk+αkpk)g(\alpha_k) = f(x_k+\alpha_kp_k), so that when

gαk=(xk+αkpk)TApkbTpk=xkTApk+αkpkTApkbTpk=0 , \frac{\partial g}{\partial \alpha_k} = (x_k + \alpha_kp_k)^TAp_k - b^Tp_k = x_k^TAp_k + \alpha_kp_k^TAp_k - b^Tp_k = 0 \,,

we have that

αk=pkT(bAxk)pkTApk=pkTrkpkTApk .\alpha_k = \frac{p_k^T(b-Ax_k)}{p_k^TAp_k} = -\frac{p_k^Tr_k}{p_k^TAp_k} \,.

Finding the next search direction pkp_k

We define the new search direction as pk=rk+βkpk1p_k = -r_k + \beta_kp_{k-1}. Premultiplying by pk1TAp_{k-1}^TA yields

pk1TApk=pk1TArk+βkpk1TApk1 , p_{k-1}^TAp_k = -p_{k-1}^TAr_k + \beta_kp_{k-1}^TAp_{k-1} \,,

where the LHS cancels to zero because of conjugacy. Solving for βk\beta_k yields

βk=pk1TArkpk1TApk1 . \beta_k = \frac{p_{k-1}^TAr_k}{p_{k-1}^TAp_{k-1}} \,.

With this, we can implement the most basic version of conjugate gradient.

function slow_conjgrad(
    A::Matrix{Float64},  # Constraint matrix
    b::Vector{Float64},  # Target
    x::Vector{Float64};  # Initial guess
    tol::Float64=10e-6   # Tolerance for early stop criterion
)
    k = 1
    r = A * x - b
    p = -r
    while norm(r) > tol
        Ap = A * p  # Avoid recomputing a matrix-vector product
        α = -(r' * p) / (p' * Ap)
        x = x + α * p
        r = A * x - b
        β = (r' * Ap) / (p' * Ap)
        p = -r + β * p
        k += 1
    end
    return x, k
end

Optimizing the conjugate gradient method

We can exploit properties of our vectors to make the above algorithm faster.

First, we characterize an interesting property of the residuals, rkr_k. Premulitplying (Eq. 1) by AA and subtracting bb from both sides yields

rk+1=rk+αkApk .(Eq. 2) r_{k+1} = r_k + \alpha_kAp_k \,. \quad\quad\text{(Eq. 2)}

If we look at the first iterate, we see

r1Tp0=r0Tp0+α0p0TAp0=r0Tp0+(p0Tr0p0TAp0)p0TAp0=0 ,\begin{aligned} r_1^Tp_0 &= r_0^T p_0 + \alpha_0p_0^TAp_0 \\ &= r_0^T p_0 + \left(-\frac{p_0^Tr_0}{p_0^TAp_0}\right)p_0^TAp_0 \\ &= \vec 0 \,, \end{aligned}

so, by induction, we can show that rkTpi=0r_k^Tp_i = 0 for all i=0,,k1i=0,\dots,k-1. That is, the residual is orthogonal to all previous search directions!

Simplifying αk\alpha_k

Premultiply the search direction update by the residual, rk :r_k \,:

pkTrk=(rk)T(rk)+βk(rk)Tpk1=rkTrk , -p_k^Tr_k = (-r_k)^T(-r_k) + \beta_k(-r_k)^Tp_{k-1} = r_k^Tr_k \,,

since the residual rkr_k and search direction pk1p_{k-1} are orthogonal. Therefore, we can simplify the calculation of αk :\alpha_k \,:

αk=pkTrkpkTApk=rkTrkpkTApk . \alpha_k = -\frac{p_k^Tr_k}{p_k^TAp_k} = \frac{r_k^Tr_k}{p_k^TAp_k} \,.

Simplifying βk\beta_k

We get the simplification in two steps

  1. αkApk=rk+1rkαkrk+1TApk=rk+1Trk+1+rk+1Trk=rk+1Trk+1 \alpha_kAp_k = r_{k+1} - r_k \Longrightarrow \alpha_kr_{k+1}^TAp_k = r_{k+1}^Tr_{k+1} + r_{k+1}^Tr_k = r_{k+1}^Tr_{k+1} since residuals are mutually orthogonal.

  2. αkpkTApk=(rkTrkpkTApk)pkTApk=rkTrk \alpha_k p_k^TAp_k = \left(\frac{r_k^Tr_k}{p_k^TAp_k}\right) p_k^TAp_k = r_k^Tr_k .

Then,

βk+1=pkTArk+1pkTApk=αkpkTArk+1αkpkTApk=rk+1Trk+1rkTrk . \beta_{k+1} = \frac{p_{k}^TAr_{k+1}}{p_{k}^TAp_{k}} = \frac{\alpha_kp_{k}^TAr_{k+1}}{\alpha_kp_{k}^TAp_{k}} = \frac{r_{k+1}^Tr_{k+1}}{r_k^Tr_k} \,.

This yields the most widely used form of the conjugate gradient method. We also store a few products at each iteration to optimize the implementation.

function fast_conjgrad(
    A::Matrix{Float64},  # Constraint matrix
    b::Vector{Float64},  # Target
    x::Vector{Float64};  # Initial guess
    tol::Float64=10e-6   # Tolerance for early stop criterion
)
    k = 1
    r = b - A * x
    p = r
    rsold = r' * r  # Avoid recomputing a vector-vector product
    while sqrt(rsold) > tol
        Ap = A * p  # Avoid recomputing a matrix-vector product
        α = rsold / (p' * Ap)
        x += α * p
        r -= α * Ap
        rsnew = r' * r
        if sqrt(rsnew) < tol  # Add an early-stop condition
            break
        end
        β = rsnew / rsold
        p = r + β * p
        rsold = rsnew
        k += 1
    end
    return x, k
end

Benchmarking versions of the conjugate gradient method

Finally, we can benchmark these two versions of the conjugate gradient method. Additionally, we will compare against a barebones implementation of gradient descent with line search:

function grad_descent(
    A::Matrix{Float64},  # Constraint matrix
    b::Vector{Float64},  # Target
    x::Vector{Float64};  # Initial guess
    tol::Float64=10e-6   # Tolerance for early stop criterion
)
    k = 1
    r = A * x - b
    rsquared = r' * r  # Avoid recomputing a vector-vector product
    while sqrt(rsquared) > tol
        α = -(rsquared) / (r' * A * r)
        x = x + α * r
        r = A * x - b
        rsquared = r' * r
        k += 1
    end
    return x, k
end

To benchmark these algorithms, we will solve a linear system where the AMnA \in M_n matrix is the Hilbert matrix,[6] the target is b=(1,,1)Tb = (1, \dots, 1)^T, and the initial guess is x=(0,,0)Tx = (0, \dots, 0)^T. We compare the number of iterations required for the slow conjugate gradient, fast conjugate gradient, and standard gradient descent method (k1k_1, k2k_2, and k3k_3 respectively), as well as the memory requirements, for different matrix dimensions n{2,4,6}n \in \{2, 4, 6\}.

[6] That is (A)i,j=1/(i+j1) .(A)_{i,j} = {1}/({i + j - 1}) \,.
n = 2
κ(A(n)) = 19.28
  595.201 ns (22 allocations: 1.73 KiB)
  446.126 ns (17 allocations: 1.34 KiB)
  5.812 μs (200 allocations: 15.64 KiB)
(k₁, k₂, k₃) = (3, 2, 40)
n = 4
κ(A(n)) = 15513.74
  1.071 μs (38 allocations: 3.66 KiB)
  814.062 ns (31 allocations: 3.00 KiB)
  6.203 ms (201280 allocations: 18.43 MiB)
(k₁, k₂, k₃) = (5, 4, 40256)
n = 6
κ(A(n)) = 1.495105864e7
  2.583 μs (70 allocations: 7.91 KiB)
  1.842 μs (59 allocations: 6.70 KiB)
  5.621 s (136467940 allocations: 14.23 GiB)
(k₁, k₂, k₃) = (9, 8, 27293588)

The fast conjugate gradient method is requires fewer iterations to achieve convergence and less memory allocation. Additionally, the standard gradient descent method requires an absurd number of iterations for a poorly conditioned matrix!

To get a better comparison between the slow and fast versions of conjugate gradient method, we will use a larger matrix. However, any larger will take gradient descent too long, so we only compare our versions of conjugate gradient with n{5,8,12,20} .n \in \{5, 8, 12, 20 \} \,.

n = 5
κ(A(n)) = 476607.25
  1.650 μs (54 allocations: 5.22 KiB)
  1.225 μs (45 allocations: 4.38 KiB)
(k₁, k₂) = (7, 6)
n = 8
κ(A(n)) = 1.525756434817e10
  7.417 μs (198 allocations: 25.19 KiB)
  4.238 μs (136 allocations: 17.44 KiB)
(k₁, k₂) = (25, 19)
n = 12
κ(A(n)) = 2.2872145734707476e16
  12.459 μs (262 allocations: 42.00 KiB)
  5.389 μs (143 allocations: 23.41 KiB)
(k₁, k₂) = (33, 20)
n = 20
κ(A(n)) = 4.377151386345373e16
  50.333 μs (638 allocations: 142.59 KiB)
  12.667 μs (234 allocations: 54.22 KiB)
(k₁, k₂) = (80, 33)

At these larger sizes of AA, the advantage of fast conjugate gradient is clearly appreciable!

The convergence rate and eigenvalue properties

TODO :)