PCA — Eigenvalue, Eigenvector, Principal Component Explained in python
PCA is a very popular dimensionality reduction technique. It is a technique that keeps getting more complicated as you go into the details. From the outside, it looks so easy that we have to reduce 10 variables to 2 variables the model is giving better accuracy. Moreover, you would have heard people saying that my PCA is explaining 80% of the variance. But has someone ever asked you what is the eigenvector and eigenvalue? and how are these related to principal components? Also, do you know if it has some relation with the covariance matrix?
Don’t worry, many of your concepts will be clear after going through this article. So let’s start with dummy data —
In our data for ease of understanding, we have only taken 4 data points with 2 variables X1 and X2. Notice that the correlation between the two variables is very high.
Let’s do a PCA in python with this data —
After fitting the PCA, there are 3 outputs that we are particularly interested in —
- PCA.explained_variance_
- PCA.components_
- PCA.transform(x_all) [x_pca]
Follow the below chart carefully —
Now you may be confused by the above chart and wondering what is it. It is nothing but just a graphical representation of the 3 outputs that we have got from the python code. Let’s understand it step by step —
- Forget everything at the beginning, and only assume that we have 4 blue dots as the original dummy data.
- Now, on the same plot, we introduced orange dots. Output x_pca is plotted as the scatter plot which is represented as the orange dots.
- Thirdly, we have got the diagonal blue line but how? Recall the basics that to plot a line passing through the origin, we need slope. If you look at the output of PCA.components_, we have got two data points. [0.489925, 0.871765], [0.871765, -0.48993]. Both these points can be thought of as a vector from the origin and interestingly both are unit vectors i.e. their magnitude is 1.
Proof = sqrt(a² + b²) = sqrt(0.489925² + 0.871765²) = 1, similarly for other data point. Remember this point because this is not a coincidence.
But where is the slope?
m1 = y/x = 0.871765/0.489925 = 1.779384
m2 = -0.48993/0.871765 = -0.56199 - At last, we are left with PCA.explained_variance_. In the x_pca, we have got two variables as output PC1 and PC2. Take the variance and you will get a similar value. You can check yourself. use =VAR.S() in excel.
One last thing before delving into the concepts is to look at the variance-covariance matrix for the original data and the principal components.
A very interesting observation to note.
- In raw data, the correlation between two variables is 0.99 which becomes 0 in the principal components.
- In raw data, variance in X1 and X2 was 8.67 and 27.33 respectively, with 24%-76%. In the principal components, most of the variance is captured in PC1 which is 99.9%
Hence, we can clearly see the advantages of using principal components in place of the raw data.
Let us now relate these practical observations to different concepts and understand them.
Eigenvectors and Eigenvalues
Firstly there is a need to understand that eigenvectors and eigenvalues are nothing without a given matrix. So in PCA, the matrix that we use is the Variance-Covariance matrix.
Now in this matrix, there are two metrics, variance, and covariance. While calculating these metrics we follow a convention that we tend to ignore because there is nothing special about it. And that convention is using the X-axis and Y-axis as the base, i.e. i^= 0 and j^ = 0.
Can we change our base axis? and if yes why do we even want to do that?
Taking the first question, yes we can change the base axis to some other vectors and accordingly plot the data points in the new axis system. So to plot the new axis, we are going to use vectors from pca.components_.
These new axes are nothing but the Eigenvectors (with magnitude 1). Hurray!
Since we have changed our axis, we will need to plot our data points according to these new axes.
Wow! isn’t that just awesome? We figure out that what we are given in x_pca is simply translating the data points from the (0,0) axis to the new axis(eigenvectors).
Now the last thing is the eigenvalues. If in a normal system we say that we project all the points in X-Y space to X-axis, how would you do that? Simply take the points (x1,x2) and set x2 as 0. It’s simple! Take the x values of the coordinates and we have projections.
Now, what would it mean to project the data points onto PC1? In the new coordinate system with orange points, we simply take the PC1 part and impute the PC2 with 0. Right? Now if we compute the variance of these projections, it is referred to as the eigenvalue of the PC1. Note that in the below example, variance for PC1 is eigenvalue for PC1.
Great! Now we also know about eigenvalue. There is another metric that is often reported in the PCA and that is the percentage of variance explained by the principal component.
In python, PCA has an attribute, pca.explained_variance_ratio that directly gives us these numbers. This is nothing but just converting the eigenvalues/variance along various principal components to percentages. As in this scenario, for PC1 the variance percentage = 35.95055*100/(35.95055 + 0.049451) = 99.9%. This indicates that only principal component 1 is able to explain the 99.9% variance in the data. Do you remember that in the raw data the x-variable was only able to explain 76% of the variance?
Here is another interesting observation. People say my PCA is able to cover 80% of the variance but where is variance lost, remember we only translated the axis. This is essentially due to the fact that in PCA we don’t take all the components otherwise our whole objective will fail instead we choose top n components which in this case would be accounting for 80% of the variance in the original data.
Variance-Covariance matrix
Now one question you might be thinking is that the whole foundation of PCA lies in finding the Eigen vectors and the whole strong revolves around it. But what do we get in return?
In the beginning, only, I mentioned that Eigenvector and Eigenvalue have significance only w.r.t to the matrix. Now by calculating the eigenvectors, we have achieved two main objectives —
- Most of the variance gets shifted towards a few of the variables
- Secondly, the correlation between the principal components has been reduced to 0
That’s exactly what we need. You might be getting curious to know how do we calculate these eigenvectors. This is an amazing tutorial that explains the process very well.
Bonus!
What will happen if we try to do the PCA on the standardized data? Will the results be similar?
Try the same code, and instead of using raw data try with the standardized data.
There is one cool property related to PCA on standardization of data that you must know. Take any number of random observations in two variables and standardize (standard scaler) the data before applying PCA. You will notice that although the variance along different components will change the eigenvectors will remain constant in magnitude.
Can you figure out the reason?
Remember, eigenvectors belong to the variance-covariance matrix. Now when we standardize the variables, we essentially bring the variance of all the variables to 1. So this can be the reason that despite taking random data, the eigenvector looked as such.
Hope now you have better clarity on the PCA! :)
Follow my blog for more such articles Analytixtime.