PCA on correlation or covariance?

  • What are the main differences between performing principal component analysis (PCA) on the correlation matrix and on the covariance matrix? Do they give the same results?

    A late reply, but you may find VERY useful handouts on multivariate data analysis "à la française" on the Bioinformatics department of Lyon. These come from the authors of the R ade4 package. It is in french, though.

  • You tend to use the covariance matrix when the variable scales are similar and the correlation matrix when variables are on different scales.

    Using the correlation matrix is equivalent to standardizing each of the variables (to mean 0 and standard deviation 1). In general, PCA with and without standardizing will give different results. Especially when the scales are different.

    As an example, take a look at this R heptathlon data set. Some of the variables have an average value of about 1.8 (the high jump), whereas other variables (run 800m) are around 120.

    library(HSAUR)
    heptathlon[,-8]      # look at heptathlon data (excluding 'score' variable)
    

    This outputs:

                       hurdles highjump  shot run200m longjump javelin run800m
    Joyner-Kersee (USA)   12.69     1.86 15.80   22.56     7.27   45.66  128.51
    John (GDR)            12.85     1.80 16.23   23.65     6.71   42.56  126.12
    Behmer (GDR)          13.20     1.83 14.20   23.10     6.68   44.54  124.20
    Sablovskaite (URS)    13.61     1.80 15.23   23.92     6.25   42.78  132.24
    Choubenkova (URS)     13.51     1.74 14.76   23.93     6.32   47.46  127.90
    ...
    

    Now let's do PCA on covariance and on correlation:

    # scale=T bases the PCA on the correlation matrix
    hep.PC.cor = prcomp(heptathlon[,-8], scale=TRUE)
    hep.PC.cov = prcomp(heptathlon[,-8], scale=FALSE)
    
    biplot(hep.PC.cov)
    biplot(hep.PC.cor)  
    

    PCA on correlation or covariance

    Notice that PCA on covariance is dominated by run800m and javelin: PC1 is almost equal to run800m (and explains $82\%$ of the variance) and PC2 is almost equal to javelin (together they explain $97\%$). PCA on correlation is much more informative and reveals some structure in the data and relationships between variables (but note that the explained variances drop to $64\%$ and $71\%$).

    Notice also that the outlying individuals (in this data set) are outliers regardless of whether the covariance or correlation matrix is used.

    What is the situation, if I convert the variables to z-scores first?

    @Jirka-x1 the covariance matrix of standardized variables (i.e. *z* scores) equals the correlation matrix.

    @Alexis Can it therefore be inferred that the covariance matrix of standardised variables equals the correlation matrix of standardised variables?

    @JamieBullock $\mathbf{\Sigma}$ (covariance matrix) for standardized data = $\mathbf{R}$ (correlation matrix). $\mathbf{R} = \mathbf{R}$ whether or not the data are standardized (correlation is insensitive to linear transformations of the data. So , for example, if you have $X$ and $Y$ and they correlate with $r_{XY}$, then if $X^{*} = aX+b$ and $Y^{*} = aY+b$ $X^{*}$ and $Y^{*}$ also correlate with $r_{XY}$).

    One important notice: when using covariance in your PCA, your PCs will not be correlated among each other, which does not hold true for correlation-based PCA. This is especially important when intending to perform PCA prior to regression in a multicolinear set of explanatory variables. However the theory behind this is not clear. Could anyone shed some light on this difference?

    So it never hurts to use the correlation matrix, correct? It can only help...

    If the variables are standardized, and then with a variation of 1, how can we then differentiate their relative importance in contributing the variance? (e.g for two-dimensional data, won't the direction of z1 always being 45 degrees?)

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM