Why do we need to normalize data before principal component analysis (PCA)?
I'm doing principal component analysis on my dataset and my professor told me that I should normalize the data before doing the analysis. Why?
- What would happen If I did PCA without normalization?
- Why do we normalize data in general?
- Could someone give clear and intuitive example which would demonstrate the consequences of not normalizing the data before analysis?
If some variables have a large variance and some small, PCA (maximizing variance) will load on the large variances. For example if you change one variable from km to cm (increasing its variance), it may go from having little impact to dominating the first principle component. If you want your PCA to be independent of such rescaling, standardizing the variables will do that. On the other hand, if the specific scale of your variables matters (in that you want your PCA to be in that scale), maybe you don't want to standardize.
Watch out: normalize in statistics sometimes carries the meaning of transform to be closer to a normal or Gaussian distribution. As @Glen_b exemplifies, it is better to talk of standardizing when what is meant is scaling by (value - mean)/SD (or some other _specified_ standardization).
Ouch, that 'principle' instead of 'principal' in my comment up there is going to drive me crazy every time I look at it.
@Glen_b In principle, you do know how to spell it. Getting it right all the time is the principal difficulty.
These are multiple questions so there is no one exact duplicate, but every one of them is extensively and well discussed elsewhere on this site. A good search to begin with is on pca correl* covariance.
@NickCox The generally accepted definition of normalise is to transform a random variable to one with zero means and unit standard deviation. This is also what Google gives when you search "define normalise". Therefore it is not better to use a different word for the same thing.
@Robino I agree with your conclusion but I disagree with your assertion. The problem is that there is not a generally accepted meaning across statistics and machine learning. Normalise is used with the sense I mention and with other senses too, e.g. scaling to within [0, 1].
@NickCox Should I use mean normalization by using x-mean/std. or just use feature scaling before applying pca.I am applying pca to images whose pixel values varies from 0-255 .
@Boris I can't possibly advise remotely on what is best for you beyond pointing that (x $-$ mean) / SD is one method possible and certainly not x $-$ mean/SD. If all your variables are in [0, 255] it's conceivable that not scaling at all makes as much sense as any other approach.
Not what I meant. Not knowing which method is best for your data and your project doesn't mean that I am implying that choice of method doesn't matter.
Normalization is important in PCA since it is a variance maximizing exercise. It projects your original data onto directions which maximize the variance. The first plot below shows the amount of total variance explained in the different principal components wher we have not normalized the data. As you can see, it seems like component one explains most of the variance in the data.
If you look at the second picture, we have normalized the data first. Here it is clear that the other components contribute as well. The reason for this is because PCA seeks to maximize the variance of each component. And since the covariance matrix of this particular dataset is:
Murder Assault UrbanPop Rape Murder 18.970465 291.0624 4.386204 22.99141 Assault 291.062367 6945.1657 312.275102 519.26906 UrbanPop 4.386204 312.2751 209.518776 55.76808 Rape 22.991412 519.2691 55.768082 87.72916
From this structure, the PCA will select to project as much as possible in the direction of Assault since that variance is much greater. So for finding features usable for any kind of model, a PCA without normalization would perform worse than one with normalization.
Great post! Perfectly reproduceable with skelarn. BTW, USArrests dataset can be downloaded from here https://vincentarelbundock.github.io/Rdatasets/datasets.html