STA 141C - Big Data & High Performance Statistical Computing
Big Data & High Performance Statistical Computing
Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: THEend8_
STA 141C - Big Data & High Performance Statistical Computing
Week 7-1: Power method
Disclaimer: My notes may contain errors, distribution outside this class is allowed only with the permission
of the Instructor.
Last time
• PCA and SVD
Today
• Power method
1 Review of singular value decomposition (SVD)
For a rectangular matrix A ∈ Rm×n, let p = min{m,n}, then we have the SVD
A = UΣV ′,
where U = (u1, . . . , um) and V = (v1, . . . , vn) are orthogonal matrices and Σ = diag(σ1, . . . , σp) is a diagonal
matrix such that σ1 ≥ σ2 ≥ · · · ≥ σp ≥ 0. σis are called the singular values, uis are the left singular vectors
and vis are the right singular vectors.
The matrix Σ is not a square matrix, one can define thin SVD, which factorizes A as
A = UnΣnV
′ =
n∑
i=1
σiuiv
′
i,
where Un ∈ Rm×n, U ′nUn = In, Σ = diag(σ1, . . . , σn). This is for m > n, if m < n, then we let V ∈ Rm×n,
The following properties are useful: for σ(A) = (σ1, . . . , σp)
′, the rank of A is the number of nonzero singular
values denoted as ‖σ(A)‖0. The Frobenius norm of A, ‖A‖F= (
∑p
i=1 σ
2
i )
1/2 = ‖σ(A)‖2, and the spectrum
norm of A, ‖A‖2= σ1 = ‖σ(A)‖∞. Using the fact that U, V are both orthogonal matrices
A′A = V ΣU ′UΣV ′ = V Σ2V ′,
AA′ = UΣV ′V ΣU ′ = UΣ2U ′
Last, the eigen-decomposition for a real symmetric matrix is B = WΛW ′, where Λ = diag(λ1, . . . , λn), which
is the SVD of B.
6-1
6-2 Week 7-1: Bo Y.-C. Ning
2 Power method
To start, let’s assume A ∈ Rn×n is a symmetric and p.s.d. matrix, the power method for obtaining the
largest eigenvalue is given as:
1) Choose an initial guess of q(0) (non-zero);
2) Repeat k = 1, . . . ,K,
z(k) = Aq(k−1)
q(k) =
z(k)
‖z(k)‖2 ;
3) Output: λ1 ← q(K)′Aq(K).
3 Why the power method works?
Let’s understand how the power method works. Before that, we need to recall a few facts:
• The eigenvalue vi attached to i-th eigenvalue λi has the relation Avi = λivi
• Given A, A =
∑n
i=1 λiviv
′
i, where λ1 ≥ · · · ≥ λn ≥ 0 and 〈vi, vj〉 = 0 for i 6= j
• Ak =
∑n
i=1 λ
k
i viv
′
i, why?
By inspecting the algorithm, we have
q(k) =
Akq(0)
‖Akq(0)‖2 .
Now given an initial guess of q(0) of unit Euclidean norm, it is possible to express
q(0) = α1v1 + α2v2 + · · ·+ αnvn,
for α1, . . . , αn are scalars. By the relation Avi = λivi,
Akq(0) = α1λ
k
1
v1 + n∑
j=2
αjλ
k
j
α1λk1
vj
For simplicity, let’s denote y(k) =
∑n
j=2
αjλ
k
j
α1λk1
vj , note that y
(k) → 0 as k →∞ as long as λ1 > λ2 ≥ · · · ≥ λn
then
q(k) =
Akq(0)
‖Akq(0)‖2 =
α1λ
k
1(v1 + y
(k))
‖α1λk1(v1 + y(k))‖2
→ v1, as k →∞
In practice, k will never goes to∞, the algorithm will stop as some K when min{‖q(K)−q(K−1)‖2, ‖−q(K)−
q(K−1)‖2} ≤ for some small .
The output q(K) is a close approximation of v1, the leading eigenvector. How to obtain the leading eigenvalue
λ1? (Hint: using Av1 = λ1v1).
A few comments:
Week 7-1: Bo Y.-C. Ning 6-3
Figure 6.1: Convergence speed of the power method.
[Source: https://web.mit.edu/18.06/www/Spring17/Power-Method.pdf.]
• The power method works well if λ1 > λ2. It converges slowly if λ1/λ2 ≈ 1.
• The convergence speed of the power method is proportional to (λ2/λ1)k, the ratio between λ2 and λ1
• For a general matrix An×p, we can apply the power method to A′A or AA′ instead. Then the output
is the absolute value of λ1.
• How to get λ2, · · ·, λn?
• Eigen-decomposition is implemented in LAPACK, see eigen() in R and np.linalg.eig in numpy.
The power method is the most basic algorithm for SVD. There are other methods such as
• Inverse power method for
finding the eigenvalue of smallest absolute value (replace A with A−1 in the power method);
• QR algorithm for symmetric eigen-decomposition (takes 4n3/3 for eigenvalues and 8n3/3 for eigenvec-
tor)
• “Golub-Kahan-Reinsch” algorithm (Section 8.6 of Golub and Van Loan); used in svd function in R
(4m2n+ 8mn2 + 9n3 flops for an m > n matrix)
• Jacobi methods (Section 8.5 of Golub and Van Loan) (suitable for parallel computing).
Concluding remarks on numerical linear algebra:
6-4 Week 7-1: Bo Y.-C. Ning
• Numerical linear algebra forms the building blocks of most computation we do. Most lines of our code
are numerical linear algebra.
• Be flop and memory aware! The form of a mathematical expression and the way the expression should
be evaluated in actual practice may be quite different.
• Be alert to problem structure and make educated choice of software/algorithm — the structure should
be exploited whenever solving a problem.
• Do not write your own matrix computation routines unless for good reason. Utilize BLAS and LAPACK
as much as possible!
• In contrast, for optimization, often we need to devise problem specific optimization routines, or even
“mix and match” them.
4 Ridge regression by SVD
In ridge regression, we minimize
‖y −Xβ‖22+λ‖β‖22
If we obtain SVD of X such that X = UΣV , then the equation is
(Σ2 + λIp)V
′β = ΣU ′y.
We get
βˆλ =
r∑
j=1
σiu
′
iy
σ2i + λ
vi, r = rank(X).
It is clear that βˆλ → βOLS as λ→ 0 and ‖βˆλ‖2 is monotone decreasing as λ→∞.