Making sense of principal component analysis, eigenvectors &amp; eigenvalues

• In today's pattern recognition class my professor talked about PCA, eigenvectors and eigenvalues.

I understood the mathematics of it. If I'm asked to find eigenvalues etc. I'll do it correctly like a machine. But I didn't understand it. I didn't get the purpose of it. I didn't get the feel of it.

I strongly believe in the following quote:

You do not really understand something unless you can explain it to your grandmother. -- Albert Einstein

Well, I can't explain these concepts to a layman or grandma.

1. Why PCA, eigenvectors & eigenvalues? What was the need for these concepts?
2. How would you explain these to a layman?

Good question. I agree with the quote as well. I believe there are many people in statistics and mathematics who are highly intelligent, and can get very deep into their work, but don't deeply understand what they are working on. Or they do, but are incapable of explaining it to others.I go out of my way to provide answers here in plain English, and ask questions demanding plan English answers.

I had imagined a lengthy demo with a bunch of graphs and explanations when I stumbled across this.

This was asked on the Mathematics site in July, but not as well and it didn't get many answers (not surprising, given the different focus there). http://math.stackexchange.com/questions/1146/intuitive-way-to-understand-principal-component-analysis

Similar to explanation by Zuur et al in Analyzing ecological data where they talk about projecting your hand on an overhead projector. You keep rotating your hand so that the projection on the wall looks pretty similar to what you think a hand should look like.

Here is the link to "Analysing ecological data" by Alain F. Zuur, Elena N. Ieno, Graham M. Smith, where the example with the overhead-projector and the hand is given: http://books.google.de/books?id=mmPvf-l7xFEC&;lpg=PA15&ots=b_5iizOr3p&dq=Zuur%20et%20al%20in%20Analyzing%20ecological%20data&hl=en&pg=PA194#v=onepage&q&f=false

A two pages article explaining PCA for biologists: Ringnér. What is principal component analysis?. Nature Biotechnology 26, 303-304 (2008)

This question lead me to a good paper, and even though I think that is a great quote it is not from Einstein. This is a common misattribution, and the more likely original quote is probably this one from Ernest Rutherford who said, "If you can't explain your physics to a barmaid it is probably not very good physics." All the same thanks for starting this thread.

Alice Calaprice, _The ultimate quotable Einstein_, Princeton U.P. 2011 flags the quotation here as one of many "Probably not by Einstein". See p.482.

A link to a geometrical account of PCA vs regression vs canonical correlation.

Explanation why PCs maximize variance and why they are orthogonal: http://stats.stackexchange.com/a/110546/3277. And what is "variance" in PCA: http://stats.stackexchange.com/a/22571/3277.

I can't explain anything to my grandmother, because she's dead. Does this mean I don't understand anything?! It might be more fun explaining things to a barmaid anyway though...

Interesting quote, considering that Einstein's mother urged him several times to explain her general relativity in a way she could understand (and he tried, without success, a number of times).

I think PCA is a hype. You can't find the meaning in data unless you already know dimensions of the data before you start. You're probably getting wrapped up in the massive hype from the industrial internet sector regarding AI and it just doesn't hold water.

6 years ago

Imagine a big family dinner, where everybody starts asking you about PCA. First you explain it to your great-grandmother; then to you grandmother; then to your mother; then to your spouse; finally, to your daughter (who is a mathematician). Each time the next person is less of a layman. Here is how the conversation might go.

Great-grandmother: I heard you are studying "Pee-See-Ay". I wonder what that is...

You: Ah, it's just a method of summarizing some data. Look, we have some wine bottles standing here on the table. We can describe each wine by its colour, by how strong it is, by how old it is, and so on (see this very nice visualization of wine properties taken from here). We can compose a whole list of different characteristics of each wine in our cellar. But many of them will measure related properties and so will be redundant. If so, we should be able to summarize each wine with fewer characteristics! This is what PCA does.

Grandmother: This is interesting! So this PCA thing checks what characteristics are redundant and discards them?

You: Excellent question, granny! No, PCA is not selecting some characteristics and discarding the others. Instead, it constructs some new characteristics that turn out to summarize our list of wines well. Of course these new characteristics are constructed using the old ones; for example, a new characteristic might be computed as wine age minus wine acidity level or some other combination like that (we call them linear combinations).

In fact, PCA finds the best possible characteristics, the ones that summarize the list of wines as well as only possible (among all conceivable linear combinations). This is why it is so useful.

Mother: Hmmm, this certainly sounds good, but I am not sure I understand. What do you actually mean when you say that these new PCA characteristics "summarize" the list of wines?

You: I guess I can give two different answers to this question. First answer is that you are looking for some wine properties (characteristics) that strongly differ across wines. Indeed, imagine that you come up with a property that is the same for most of the wines. This would not be very useful, wouldn't it? Wines are very different, but your new property makes them all look the same! This would certainly be a bad summary. Instead, PCA looks for properties that show as much variation across wines as possible.

The second answer is that you look for the properties that would allow you to predict, or "reconstruct", the original wine characteristics. Again, imagine that you come up with a property that has no relation to the original characteristics; if you use only this new property, there is no way you could reconstruct the original ones! This, again, would be a bad summary. So PCA looks for properties that allow to reconstruct the original characteristics as well as possible.

Surprisingly, it turns out that these two aims are equivalent and so PCA can kill two birds with one stone.

Spouse: But darling, these two "goals" of PCA sound so different! Why would they be equivalent?

You: Hmmm. Perhaps I should make a little drawing (takes a napkin and starts scribbling). Let us pick two wine characteristics, perhaps wine darkness and alcohol content -- I don't know if they are correlated, but let's imagine that they are. Here is what a scatter plot of different wines could look like:

Each dot in this "wine cloud" shows one particular wine. You see that the two properties ($$x$$ and $$y$$ on this figure) are correlated. A new property can be constructed by drawing a line through the center of this wine cloud and projecting all points onto this line. This new property will be given by a linear combination $$w_1 x + w_2 y$$, where each line corresponds to some particular values of $$w_1$$ and $$w_2$$.

Now look here very carefully -- here is how these projections look like for different lines (red dots are projections of the blue dots):

As I said before, PCA will find the "best" line according to two different criteria of what is the "best". First, the variation of values along this line should be maximal. Pay attention to how the "spread" (we call it "variance") of the red dots changes while the line rotates; can you see when it reaches maximum? Second, if we reconstruct the original two characteristics (position of a blue dot) from the new one (position of a red dot), the reconstruction error will be given by the length of the connecting red line. Observe how the length of these red lines changes while the line rotates; can you see when the total length reaches minimum?

If you stare at this animation for some time, you will notice that "the maximum variance" and "the minimum error" are reached at the same time, namely when the line points to the magenta ticks I marked on both sides of the wine cloud. This line corresponds to the new wine property that will be constructed by PCA.

By the way, PCA stands for "principal component analysis" and this new property is called "first principal component". And instead of saying "property" or "characteristic" we usually say "feature" or "variable".

Daughter: Very nice, papa! I think I can see why the two goals yield the same result: it is essentially because of the Pythagoras theorem, isn't it? Anyway, I heard that PCA is somehow related to eigenvectors and eigenvalues; where are they on this picture?

You: Brilliant observation. Mathematically, the spread of the red dots is measured as the average squared distance from the center of the wine cloud to each red dot; as you know, it is called the variance. On the other hand, the total reconstruction error is measured as the average squared length of the corresponding red lines. But as the angle between red lines and the black line is always $$90^\circ$$, the sum of these two quantities is equal to the average squared distance between the center of the wine cloud and each blue dot; this is precisely Pythagoras theorem. Of course this average distance does not depend on the orientation of the black line, so the higher the variance the lower the error (because their sum is constant). This hand-wavy argument can be made precise (see here).

By the way, you can imagine that the black line is a solid rod and each red line is a spring. The energy of the spring is proportional to its squared length (this is known in physics as the Hooke's law), so the rod will orient itself such as to minimize the sum of these squared distances. I made a simulation of how it will look like, in the presence of some viscous friction:

Regarding eigenvectors and eigenvalues. You know what a covariance matrix is; in my example it is a $$2\times 2$$ matrix that is given by $$\begin{pmatrix}1.07 &0.63\\0.63 & 0.64\end{pmatrix}.$$ What this means is that the variance of the $$x$$ variable is $$1.07$$, the variance of the $$y$$ variable is $$0.64$$, and the covariance between them is $$0.63$$. As it is a square symmetric matrix, it can be diagonalized by choosing a new orthogonal coordinate system, given by its eigenvectors (incidentally, this is called spectral theorem); corresponding eigenvalues will then be located on the diagonal. In this new coordinate system, the covariance matrix is diagonal and looks like that: $$\begin{pmatrix}1.52 &0\\0 & 0.19\end{pmatrix},$$ meaning that the correlation between points is now zero. It becomes clear that the variance of any projection will be given by a weighted average of the eigenvalues (I am only sketching the intuition here). Consequently, the maximum possible variance ($$1.52$$) will be achieved if we simply take the projection on the first coordinate axis. It follows that the direction of the first principal component is given by the first eigenvector of the covariance matrix. (More details here.)

You can see this on the rotating figure as well: there is a gray line there orthogonal to the black one; together they form a rotating coordinate frame. Try to notice when the blue dots become uncorrelated in this rotating frame. The answer, again, is that it happens precisely when the black line points at the magenta ticks. Now I can tell you how I found them: they mark the direction of the first eigenvector of the covariance matrix, which in this case is equal to $$(0.81, 0.58)$$.

Per popular request, I shared the Matlab code to produce the above animations.

+1 Nice tale and illustrations. ...then to your mother; then to your wife; finally, to your daughter (who is a mathematician)... I'd continue: and after the dinner - to yourself. And here you suddenly got stuck...

I absolutely love the illustrations you make for these answers.

@amoeba - This is great! You know, I think that the problem in internalizing PCA goes beyond understanding the geometry, the eigenvectors, the covariance... It's about the fact that the original variables have names (alcohol content, wine color), but the transformation of the data via PCA results in components that are nameless... What do you do with things without names... Do you say that the wines in your data are intoxicatingly red? Do you make up qualities... if so we shouldn't call them 'PC1'... wines call for some more poetic approach...

I normally just browse through Cross Validated to read up on things, but I've never had reason to create an account... mainly because the kinds of questions here are outside of my expertise and I can't really answer any. I usually am on StackOverflow only and I've been on the StackExchange network for about a year now. However, I've only decided to create an account today primarily to upvote your post. This is probably the best exposition of PCA that I've ever read, and I've read many. Thank you for this wonderful post - the excellent storytelling, the graphics, and it's so easy to read! +1

Brilliant, thanks for posting this. Might I ask which program produced these graphs?

@JohnK, the plots and animations were done in Matlab. I am glad you liked them, thanks.

Note for myself: my answer currently has 100 upvotes, JDLong's one has 220 upvotes; if we assume constant growth then mine has 100 upvotes/year and his has 40 upvotes/year. Or rather 55/year if computed since it passed 100 upvotes [got a golden badge] in Jan 2014. This means that I will catch up in 2.5--3 years, around the end of 2018. Let's see :-)

Note for myself cont.: I have to update my estimate. One month later this answer got 18 upvotes vs 5 for JDLong's. This indicates that I might catch up in below a year from now. Interestingly, 5/month is very close to my above estimate of 55/year, but 18/month is more than twice above 100/year. As my answer did not change, it seems that getting to the second place accelerated the upvoting (probably due to the increased visibility).

@amoeba Fantastic answer! I have a small doubt though; in the last line, you said the eigenvector is (0.81,0.58). What does these numbers signify? How did you get the magenta ticks from this? In other words, what is it's meaning geometrically?

@amoeba A small suggestion; the text after the final visualization is pretty hard to understand. It'd be great if you could elaborate that a little bit more.

@AnmolSinghJaggi The first eigenvector is (0.81, 0.58), these are just vector coordinates: the vector goes from point (0,0) to point (0.81, 0.58). If you draw this vector and continue it in the same direction, it will go through the magenta tick; as I wrote, the ticks "mark the direction of the first eigenvector". I can see that the last part can be difficult to understand, but I thought that it's not the place to explain eigenvectors and eigenvalues in more detail... I might add something though. Thank you.

@amoeba: Brillant answer! Thank you so much. I wonder why this is not the best answer!

@amoeba Hey man, your awesome! i've been struggling with PCA for about a week, i haven't understand the phrase that says "And the variance of this variable is the maximum among all possible choices of the first axis" with your graphs i've finally got it! I think god loved me because i've scrolled down and saw your answer! and once again, your awesome!

@claws: Honestly, I did not expect you to ever accept any answer in this thread! I am flattered. The wine figure is a good addition too, cheers.

Despite that I was the first in the queue to applaud the answer, I can't find approval for the particular latest addition of the chart with wine glasses (or are they the bladders conditions, the upper row being hematuric?) - just can't see how it can further refine any sober intuition about PCA. I'd recommend to dismiss the family party, for bed at last. Or the answer might worsen.

@ttnphns Thanks for the feedback. It was not me who added the wine glass figure. It was added (earlier today) by the OP claws, who has accepted this answer at the same time. I hesitated whether I should leave this picture or remove it... Perhaps you are right and it's better to remove it after all. It takes a lot of space. Maybe I will replace the image with a link to it.

@ttnphns I asked this question few years ago when I was in college. Now I'm in a field which absolutely has nothing to do with Engineering or Sciences. That is to say, I became a layman for practical purposes. Now when I read both the toprated answers I could understand this one much more clearly. I also happened belong to part the of the world, which isn't aware of having wine for dinner, in fact, I haven't seen a glass of wine ever in my life. So when the expression "kinds of wines" I had to google search for images and find out. It is in this context I had to add that image.

@ttnphns: As stackexchange is a global site, I thought it would also help other people like me. Feel free to remove it if you think otherwise. Moreover, I feel, if the title of the question is appropriately changed to something like "What is Pricnicple component analysis in laymen terms?" (or something like that) the laymen who land on this page from Google search would only find this image more welcoming and accommodating in contrast to graphs which are intimidating.

You *have not seen a glass of wine ever in your life*, @claws?! If this is literally true, I am amazed and bewildered. I know many people who never drink wine or any alcohol at all, and I probably know some adults (but very very few) who have *never* tried a glass of wine. But to never have *seen* a glass of wine -- how is it even possible? I cannot imagine what "part of the world" you must be living in then. My best guess would be some religious community in India... I guess you don't want to disclose your location though.

@amoeba: haha! from where I stand it also amazes me that someone would be amazed or thinks that its almost impossible to have not seen a glass of wine. How is that even possible? I have hundreds of friends who haven't seen one. I can safely say that *irrespective of religion* in my country, number of people more than the population of US or whole of Europe *haven't ever seen a glass of wine*. I know few people (although very very very few) who have tasted it.

@ttnphns I removed the image but inserted the link to it into the text.

@amoeba you truly are a genius.

@amoeba by far one of the best (if not 'the best') answer I have ever read. RESPECT!

Note to myself, update 2: This answer turned the most upvoted one in this thread today (264 upvotes)! Earlier than the projected estimate of December, probably because it was marked as accepted in August (this can increase the upvote rate). The new goal is to get to the 1st place across all CV answers; that's three answers to overtake. The top one currently has 326 upvotes, i.e. is 60 ahead.

@amoeba great answer. What software did you use for the rotating plot? Any chance to post the code? Thanks again.

@amoeba This is on the top 3 best answers that I've ever read on the whole stackexchange network. Chapeau, I hope that you are a teacher or you will become one!

This answer is the most upvoted answer (363) on CrossValidated as of today :-) Just for the record: there are many much better answers on CrossValidated, so this "achievement" mainly reflects the popularity of the PCA topic.

Great answer! If I had to choose my favorite quote on CV, it would probably be "Excellent question, granny!". :-)

This is an amazingly beautiful answer!

You almost lost me at great-grandmother. I have a lot to learn. +1

Went through so many good posts to reach the best.

I am trying to reproduce this, but the distance between my "red dots" on the line and original coordinates are not the shortest perpendicular distance. T = (X * V(:,1)) * V(:,1)'; So, is T supposed to be equivalent to your red dots (given that V(:,1) is the first eigenvector and X is mean centered)?

This is a great explanation - the derivation to eigenvectors is excellent.

What assumptions does PCA make about data?

You have a real gift. This is amazing!

I can tell from the animation that when it lines up with the magenta lines that error is minimized, but I don't understand how that also maximizes variance. Variance (at least in the definition used in statistics) should be maximized by having as many points as far away from the mean as possible -- first it's unclear why that's affected by any line we draw since that should be a property of the original data -- second if I take it to instead indicate "most diversity in distances between points and the line" it would be when the line is vertical. Help?

@JosephGarvin It's the variance of the red points (projections) on the black line.

Till the wine its fine, but the moment you will say "But many of them will measure related properties and so will be redundant. If so, we should......", my grandma would be lost.