What are principal component scores?

  • First, let's define a score.

    John, Mike and Kate get the following percentages for exams in Maths, Science, English and Music as follows:

          Maths    Science    English    Music    
    John  80        85          60       55  
    Mike  90        85          70       45
    Kate  95        80          40       50
    

    In this case there are 12 scores in total. Each score represents the exam results for each person in a particular subject. So a score in this case is simply a representation of where a row and column intersect.

    Now let's informally define a Principal Component.

    In the table above, can you easily plot the data in a 2D graph? No, because there are four subjects (which means four variables: Maths, Science, English, and Music), i.e.:

    • You could plot two subjects in the exact same way you would with $x$ and $y$ co-ordinates in a 2D graph.
    • You could even plot three subjects in the same way you would plot $x$, $y$ and $z$ in a 3D graph (though this is generally bad practice, because some distortion is inevitable in the 2D representation of 3D data).

    But how would you plot 4 subjects?

    At the moment we have four variables which each represent just one subject. So a method around this might be to somehow combine the subjects into maybe just two new variables which we can then plot. This is known as Multidimensional scaling.

    Principal Component analysis is a form of multidimensional scaling. It is a linear transformation of the variables into a lower dimensional space which retain maximal amount of information about the variables. For example, this would mean we could look at the types of subjects each student is maybe more suited to.

    A principal component is therefore a combination of the original variables after a linear transformation. In R, this is:

    DF<-data.frame(Maths=c(80, 90, 95), Science=c(85, 85, 80), English=c(60, 70, 40), Music=c(55, 45, 50))
    prcomp(DF, scale = FALSE)
    

    Which will give you something like this (first two Principal Components only for sake of simplicity):

                    PC1         PC2
    Maths    0.27795606  0.76772853 
    Science -0.17428077 -0.08162874 
    English -0.94200929  0.19632732 
    Music    0.07060547 -0.60447104 
    

    The first column here shows coefficients of linear combination that defines principal component #1, and the second column shows coefficients for principal component #2.

    So what is a Principal Component Score?

    It's a score from the table at the end of this post (see below).

    The above output from R means we can now plot each person's score across all subjects in a 2D graph as follows. First, we need to center the original variables my subtracting column means:

          Maths    Science    English    Music    
    John  -8.33       1.66       3.33       5  
    Mike   1.66       1.66      13.33      -5
    Kate   6.66       -3.33    -16.66       0
    

    And then to form linear combinations to get PC1 and PC2 scores:

          x                                                    y
    John -0.28*8.33 + -0.17*1.66 + -0.94*3.33  + 0.07*5   -0.77*8.33 + -0.08*1.66 + 0.19*3.33   + -0.60*5 
    Mike 0.28*1.66  + -0.17*1.66 + -0.94*13.33 + -0.07*5   0.77*1.66 + -0.08*1.66 + 0.19*13.33  + -0.60*5
    Kate 0.28*6.66  + 0.17*3.33  + 0.94*16.66  + 0.07*0    0.77*6.66 +  0.08*3.33 + -0.19*16.66 + -0.60*0
    

    Which simplifies to:

            x       y
    John   -5.39   -8.90
    Mike  -12.74    6.78
    Kate   18.13    2.12
    

    There are six principal component scores in the table above. You can now plot the scores in a 2D graph to get a sense of the type of subjects each student is perhaps more suited to.

    The same output can be obtained in R by typing prcomp(DF, scale = FALSE)$x.

    EDIT 1: Hmm, I probably could have thought up a better example, and there is more to it than what I've put here, but I hope you get the idea.

    EDIT 2: full credit to @drpaulbrewer for his comment in improving this answer.

    Effort is commendable -- BUT -- neither PC1 nor PC2 tells you who did best in all subjects. To do so the PC subject coeffcients would all have to be positive. PC1 has positive weights for Math and Music but negative for Science and English. PC2 has positive weights for Math and English but negative for Science and Music. What the PCs tell you is where the largest variance in the dataset lies. So by weighting the subjects by the coefficients in PC1, and using that to score the students, you get the biggest variance or spread in student behaviors. It can classify types but not performance.

    +1 good comment, cheers. You are of course correct, I should have have written that better and have now edited the offending line to make it clear I hope.

    You could standardise the vars, hence calculate the sum, in order to see who's the best, or if you prefer, in R: `apply(dtf, 1, function(x) sum(scale(x)))`

    Shouldn't the line "At the moment we have four variables which each represent just one subject" read "At the moment we have THREE variables which each represent just one subject" ?

    @JohnPrior The four variables (columns) are Maths, Science, English and Music, and the rows represent individuals. The term "subject" becomes ambiguous at times because five years ago I chose an awful example for an answer.

    @Tony, I went ahead and edited your answer to center the variables before computing the scores. Now the computed scores fit to what `prcomp` outputs. Before it did not.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM