I Binged Naruto.

The first show I binged was Save By The Bell. I had gotten my hands on a complete series disc set and went to town watching each episode was care. I observed the fashion and chuckled at the jokes. I…

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Introduction to PCA

This is part 3 of my CS M146 Notes series.

Sometime we want to reduce the dimensionality of data. Maybe because there are too many features or maybe we want to reduce them down to two features so that we can visualize them. We want to map the original dataset 𝑋 from 𝑑 dimensional features to 𝑘 dimensional feature space.

We can think of the problem as finding a linear transformation matrix 𝑃 that transforms 𝑋 into 𝑍.

How do we know the transformation 𝑃 we picked is good? We will pick 𝑃 such that the variance of 𝑍 is maximized. Why? Consider the case where 𝑘 is 1.

Intuitively, it seems projecting the data points on the line on the left is better. If we choose the line on the right, we will have hard time distinguish data points because they are so close to each other. In other words, the variance is small. Therefore, we want to variance to be as large as possible.

We will first start with when 𝑘 is 1 (i.e. reducing the dimensionality to 1). Then we will think about the general case.

This is same as saying

There is one thing we need to do with PCA algorithm, which is to center the data points 𝑋. In other words, subtract each data point with its mean.

Why do we do this? We will see this will make the math easier later. From now on, whenever we see 𝑋, we will consider it as already centered.

So again, our objective is to maximize the variance of projected data points 𝑍.

Recall the formula of the variance is the follow:

Note since 𝑋 is already centered, 𝑍 is also centered. In other words, the mean of 𝑍 is zero. Therefore, the variance of 𝑍 is simply averaged sum of the squares.

We can substitute 𝑥 and 𝑤 back.

We can vectorize the expression.

Let 𝐶 be 𝑋ᵀ𝑋/N. 𝐶 is called covariance matrix. Then our objective is

But since there is no constraint on 𝑤, 𝑤 will grow to infinity. We need a constraint on 𝑤. Recall 𝑤 is just the normal vector of the projection line. Thus, we can set the length of 𝑤 be 1 without loss of generality.

We already know how to solve constrained optimization problem from previous part — that’s right, we use Lagurangian.

Our objective is the follow.

Taking the derivative with respect to λ only gives us the constraint back. But taking it with respect to 𝑤 gives us something interesting.

This is saying that 𝑤 is one of the eigenvectors of 𝐶. But which eigenvector? We can plug the result back to the objective.

It turns out the variance is just λ. Since we are maximizing λ, we want the eigenvector with the largest eigenvalue.

When 𝑘 is 2, here’s our objective.

We just simply add the variance of the second column of 𝑍 to our optimization. If we just solve this optimization problem, 𝑤₂ will be same as 𝑤₁, and so as 𝑧₂ and 𝑧₁. But this is bad. This is like reducing the dimensions to ℝ² but projecting the data points to a line instead of a plane. And since we don’t want any redundancy in our second eigenvector, the first eigenvector must be perpendicular to the first one.

In other words, the second eigenvector corresponds to the second largest eigenvalue. In general, when we want to reduce the dimensions to 𝑘, 𝑃 is

where the eigenvectors are sorted such that the corresponding eigenvalues are in descending order.

I Binged Naruto.

Introduction to PCA

Add a comment

Related posts:

Is it time for the Red Wings to move on from Jeff Blashill?

New Audience Capabilities Added to Smartloop Conversational AI Platform

Rain