### Can principal component analysis be applied to datasets containing a mix of continuous and categorical variables?

• I have a dataset that has both continuous and categorical data. I am analyzing by using PCA and am wondering if it is fine to include the categorical variables as a part of the analysis. My understanding is that PCA can only be applied to continuous variables. Is that correct? If it cannot be used for categorical data, what alternatives exist for their analysis?

• Although a PCA applied on binary data would yield results comparable to those obtained from a Multiple Correspondence Analysis (factor scores and eigenvalues are linearly related), there are more appropriate techniques to deal with mixed data types, namely Multiple Factor Analysis for mixed data available in the FactoMineR R package (AFDM()). If your variables can be considered as structured subsets of descriptive attributes, then Multiple Factor Analysis (MFA()) is also an option.

The challenge with categorical variables is to find a suitable way to represent distances between variable categories and individuals in the factorial space. To overcome this problem, you can look for a non-linear transformation of each variable--whether it be nominal, ordinal, polynomial, or numerical--with optimal scaling. This is well explained in Gifi Methods for Optimal Scaling in R: The Package homals, and an implementation is available in the corresponding R package homals.

chl, thanks for the pointer to FADM. I was wondering though: once I apply FADM to a data set (obj <- FADM(x)), I can access the transformed data set easily via: obj\$ind\$coord. However, if I want to apply the _same_ transformation to another data set, how can I do so? (This is necessary for example, if I have a train set, and I find the "principal components" from this train set, and then want to look at the test set through those "principal components"). The documentation isn't really clear on this, and the paper the function is based on is in french.

Regarding: Although a PCA applied on binary data would yield results comparable to those obtained from a Multiple Correspondence Analysis, can we not convert a nominal categorical variable (let's say with N cardinality) into a collection of (N-1) dummy binaries and then perform PCA on this data? ( I understand there are more appropriate techniques)

• A Google search "pca for discrete variables" gives this nice overview by S. Kolenikov (@StasK) and G. Angeles. To add to chl answer, the PC analysis is really analysis of eigenvectors of covariance matrix. So the problem is how to calculate the "correct" covariance matrix. One of the approaches is to use polychoric correlation.

(+1) Thanks for the link.It is also possible to consider a heterogeneous correlation matrix (see, e.g. hetcor() from the polycor package). Provided the VC matrix is SDP, it should do the work--mostly in the spirit of Factor Analysis. Nominal variables might be dummy coded.

@StasK, kudos :) It seems that not only me found this talk useful, otherwise it would not be on the top in gooogle search. This question pops up from time to time, so maybe you want to do a blog post about it for our community blog?

@StasK, I've edited the post to mention the authors of the overview. My initial intention was to demonstrate that searching on google can come up with good answers, so there is no explicit need to ask here. But this is not an excuse to not cite the authors, given the volatility of the internet.

@mpiktas, thanks. There was a real article aimed at economists produced from this work: http://dx.doi.org/10.1111/j.1475-4991.2008.00309.x, although the editors asked us to cut so much out that I suggest to read the working paper for information, and cite the published one.

• I would suggest having a look at Linting & Kooij, 2012 "Non linear principal component analysis with CATPCA: a tutorial", Journal of Personality Assessment; 94(1).

Abstract

This article is set up as a tutorial for nonlinear principal components analysis (NLPCA), systematically guiding the reader through the process of analyzing actual data on personality assessment by the Rorschach Inkblot Test. NLPCA is a more flexible alternative to linear PCA that can handle the analysis of possibly nonlinearly related variables with different types of measurement level. The method is particularly suited to analyze nominal (qualitative) and ordinal (e.g., Likert-type) data, possibly combined with numeric data. The program CATPCA from the Categories module in SPSS is used in the analyses, but the method description can easily be generalized to other software packages.

• I am yet to get privilege for commenting on someone's post so I am adding my comment as a separate answer, so please bear with me.

Continuing on what @Martin F commented, recently I came across with the nonlinear PCAs. I was looking into Nonlinear PCAs as a possible alternative when a continuous variable approaches distribution of an ordinal variable as the data gets sparser (it happens in genetics a lot of times when the minor allele frequency of the variable gets lower and lower and you are left with very low number of counts in which you cant really justify a distribution of a continuous variable and you have to loosen the distributional assumptions by making it either an ordinal variable or a categorical variable.) Non linear PCA can handle both of these conditions but after discussing with statistical maestros in genetics faculty, the consensus call was that the Nonlinear PCAs are not used much often and the behavior of those PCAs is not yet tested extensively (may be they were referring to only genetics field, so please take it with grain of salt). Indeed it is a fascinating option. I hope I have added 2cents (fortunately relevant) to discussion.

Welcoming your answer, Mandar. Are you referring to nonlinear PCA by CATPCA method or another nonlinear PCA (what method, then?). Note also that for _binary_ variables, CATPCA is, say, useless or trivial because a the dichotomous scale cannot be quantified other than to... dichotomous!

thnak you @ttnphns. I agree to your point about the binary variables as for binary variable, any assumption does not matter. Otherwise I was actually referring to a book chapter from "Introduction to nonlinear PCA" link. It refers to CATPCA mainly and PRINQUAL packages from SAS.

• There is a recently developed approach to such problems: Generalized Low Rank Models.

One of papers that use this technique is even called PCA on a Data Frame.

PCA can be posed like this:

For $n$ x $m$ matrix $M$

find $n$ x $k$ matrix $\hat{X}$ and $k$ x $m$ matrix $\hat{Y}$ (this encodes rank $k$ e constraint implicitly) such that

$\hat{X}, \hat{Y}$ = $\underset{X, Y}{argmin} \| M - XY \|_F^2$.

The 'generalized' from GLRM stands for changing $\| \cdot \|_F^2$ to something else and adding a regularization term.

This sounds more like reinvention than new idea. Search for gifi!

You're not exactly right, it seems that GLRM is a generalization (actually the paper I linked cites the gifi package paper).

• PCAmixdata #Rstats package:

Implements principal component analysis, orthogonal rotation and multiple factor analysis for a mixture of quantitative and qualitative variables.

Example from vignette shows results for both continuous and categorical output