~ read.

Principal Component Analysis

One of the most useful and famous techniques for dimensionality reduction is undoubtedly Principal Component Analysis, or in short PCA.

When confronted with a high-dimensional data set, one might want to reduce the dimensionality while preserving as much of the variation in the data as possible. This is precisely what PCA accomplishes: It constructs new variables (the name-giving 'principal components') as linear combinations of the original variables with the objective to preserve as much of the variance as possible.

To make matters concrete, suppose that we have a set of random variables

$$
X = (X_1, \ldots, X_p).
$$

Let us denote the principal components as $Z = (Z_1, \ldots, Z_p)$, where each component is a linear combination of the elements in $X$. Hence, for any $k \in \{1, \ldots, p \}$, we have

$$
Z_k = a_k^\intercal X = a_{k1} X_1 + \ldots + a_{kp} X_p
$$

Thus, for each principal component $Z_k$, we have a corresponding vector of weights $a_k$ which governs its composition from the original variables. But how are these weights determined?

As already alluded to, the goal of PCA is to find principal components which account for as much variance in the data as possible. Precisely, in PCA each principal component maximizes the remaining variance not accounted for by the previous principal components. This is equivalent to requiring that the principal components have to be uncorrelated with each other.

So, to find the first PC, we wish to solve the maximization problem

$$
\max_{a_1} \operatorname{Var}(Z_1) = \operatorname{Var}(a_1^\intercal X) = a_1^\intercal \operatorname{Var}(X) a_1.
$$

Given a data set $\mathbf{X}$ of observations $x_i = (x_{i1}, \ldots, x_{ip})$, we can approximate the covariance matrix $\operatorname{Var}(X)$

by the sample covariance matrix

$$
\hat \Sigma = \sum_{i=1}^n (x_i - \bar x) (x_i - \bar x)^\intercal,
$$

in which $\bar x$ is the vector of the sample means. However, observe that we could just blow the objective function up by choosing very large values for the elements of $a_1$. To prohibit this and make the problem well-posed, we additionally require that $a_1^\intercal a_1 = 1$. Hence, we now maximize

$$
\max_{a_1} a_1^\intercal \hat \Sigma a_1 \text{ subject to } a_1^\intercal a_1 = 1.
$$

Using Lagrange multipliers, the corresponding Lagrange function is given by

$$
\Lambda = a_1^\intercal \hat \Sigma a_1 - \lambda (a_1^\intercal a_1 - 1)
$$

Its gradient with respect to $a_1$ is given by

$$
\nabla_{a_1} \Lambda = 2 \hat \Sigma a_1 - 2 \lambda a_1.
$$

Setting the gradient equal to zero yields the following system of equations:

$$
\hat \Sigma a_1 = \lambda a_1.
$$

As we can see, $a_1$ is an eigenvector of $\hat \Sigma$ with corresponding eigenvalue $\lambda$. Hence, our objective of finding the first principal component has reduced to finding an eigenvector of $\hat \Sigma$. However, not any eigenvector suffices, as our objective function evaluates to

$$
a_1^\intercal \hat \Sigma a_1 = a_1^\intercal \lambda a_1 = \lambda,
$$

since $a_1^\intercal a_1 = 1$. Hence, $a_1$ must be the eigenvector with the highest eigenvalue. To calculate the second principal component, we solve basically the same optimization problem, with the additional constraint that the new weight vector must be orthogonal to the old one, i.e. $a_2^T a_1 = 0$. This yields

$$
\max_{a_2} a_2^\intercal \hat \Sigma a_2 \text{ subject to } a_2^\intercal a_2 = 1 \text{ and } a_2^T a_1 = 0.
$$

The Lagrangian is

$$
\Lambda = a_2^\intercal \hat \Sigma a_2 - \lambda_1 (a_2^\intercal a_2 - 1) - \lambda_2 (a_2^T a_1).
$$

Following the same steps as above, we calculate the gradient and obtain

$$
\nabla_{a_2} \Lambda = 2 \hat \Sigma a_2 - 2 \lambda_1 a_2 - \lambda_2 a_1.
$$

Setting it equal to zero gives

$$
2\hat \Sigma a_2 - 2 \lambda_1 a_2 = \lambda_2 a_1.
$$

Multiplying above equation from the left with $a_1^\intercal$ gives us

$$
2\hat a_1^\intercal \Sigma a_2 - 2 \lambda_1 a_1^\intercal a_2 = \lambda_2 a_1^\intercal a_1
$$

Now observe the following: Since $a_2^T a_1 = 0$, the second term on the left vanishes. Similarly, the right-hand side becomes $\lambda_2$ as $a_1^T a_1 = 1$. And finally, we have $a_1^\intercal \Sigma a_2 = \operatorname{tr}(a_1^\intercal \Sigma a_2) = \operatorname{tr}(\Sigma a_1^\intercal a_2) = 0$, where the first equality holds because the trace of a scalar value is the value itself.

From this, we can conclude that

$$
\lambda_2 = 0,
$$

and thus we retrieve the same optimality condition for $a_2$:

$$
\Sigma a_2 = \lambda_1 a_2.
$$

Since we have already used up the eigenvector with the largest eigenvalue, we therefore will choose $a_2$ to be the eigenvector with the second-largest eigenvalue. The above argument can be generalized to find all remaining weight vectors as the eigenvectors of the sample covariance matrix. I hope this post helped to illustrate some of the mathematical underpinnings of PCA. In a follow-up, I will discuss strategies of picking the number of PC's to retain after running PCA and how to carry it out in statistical programming language R.