How to deal with non-binary categorical variables in logistic regression (SPSS)
I have to do binary logistic regression with a lot of independent variables. Most of them are binary, but a few of the categorical variables have more than two levels.
What is the best way to deal with such variables?
For example, for a variable with three possible values, I suppose that two dummy variables have to be created. Then, in a step-wise regression procedure, it is better to test both of the dummy variables at the same time, or to test them separately?
I will use SPSS, but I do not remember it very well, so: how does SPSS deal with this situation?
Moreover, for an ordinal categorical variable, it is a good thing to use dummy variables which recreate the ordinal scale? (For example, using three dummy variables for a 4-state ordinal variable, put
0-0-0for level $1$,
1-0-0for level $2$,
1-1-0for level $3$ and
1-1-1for level $4$, instead of
0-0-1for the 4 levels.)
This is just a partial answer: even when you create the dummies explicitly (rather than using the software's implicit capabilities), keep them together in all analyses. In particular, they should all enter together and all leave together in a stepwise regression, with the p-value computed appropriately for the total number of variables involved. (This is Hosmer & Lemeshow's recommendation, anyway, and it makes a lot of sense.)
I wrote a post a while back on multinomial logistic regression resources in SPSS.
You're talking about your independent variables. It's only the dependent variables that need to be binary for logistic regression.
The UCLA website has a bunch of great tutorials for every procedure broken down by the software type that you're familiar with. Check out Annotated SPSS Output: Logistic Regression -- the SES variable they mention is categorical (and not binary). SPSS will automatically create the indicator variables for you. There's also a page dedicated to Categorical Predictors in Regression with SPSS which has specific information on how to change the default codings and a page specific to Logistic Regression.
Logistic regression is a pretty flexible method. It can readily use as independent variables categorical variables. Most software that use Logistic regression should let you use categorical variables.
As an example, let's say one of your categorical variable is temperature defined into three categories: cold/mild/hot. As you suggest you could interpret that as three separate dummy variables each with a value of 1 or 0. But, the software should let you use a single categorical variable instead with text value cold/mild/hot. And, the logit regression would derive coefficient (or constant) for each of the three temperature conditions. If one is not significant, the software or the user could readily take it out (after observing t stat and p value).
The main benefit of grouping categorical variable categories into a single categorical variable is model efficiency. A single column in your model can handle as many categories as needed for a single categorical variable. If instead, you use a dummy variable for each categories of a categorical variable your model can quickly grow to have numerous columns that are superfluous given the mentioned alternative.
@gaetan I do not understand the remark about a single column vs multiple columns. Are you suggesting that categorical variables should be coded as 1, 2, 3 etc in a single column instead of using dummy variables? I am not sure that makes sense to me as you are then imposing an implicit constraint that the difference in the effect on dv between leve1s 1 and 2 is the same as the difference in the effect on dv between levels 2 and 3. Perhaps, I am missing something.
@Gaetan I am not sure I follow you. How exactly does XLStat transform the 'text' values of cold, mild or hot into numerical values for the purpose of estimation? If there is a method that will let you estimate the effects of categorical variables without using dummy variables surely that should be independent of the software you use as there should be some underlying conceptual/model based logic.
@Gaetan I don't follow your point unless you consider that your ordinal variable is treated as a continuous one (this might make sense sometimes, although we clearly assume that the variable can inherit the property of an interval scale as pointed by @Skrikant). Usually, a variable with $k$ levels is represented in the design matrix as $k-1$ columns, and I think this is quite independent of the software used (surely, XLStat takes care of constructing the correct design matrix as R, SPSS or Stata does).
@Srikant. XLStat in its demo used an example a model where the dependent variable was probability of renewing a subscription. And, one categorical variable within a single column had 6 different age ranges of subscribers. Using a maximum likelihood algorithm, it can interpret each specific age range as a separate data set equivalent to a separate dummy variable in its own column. When you choose that specific column, you just have to state it is a "qualitative" variable (instead of a "quantitative" one). From everyone comments, I gather this is not something you can code in SPSS.
@Gatean Ok, in this case, the same can be done in SPSS (you have the choice between numerical/ordinal/nominal for each variable) -- then, the design matrix is constructed accordingly.
@Gaetan @chl To summarize my understanding: The features of SPSS and XLStat whereby you can specify the measurement scale (nominal, ordinal etc) decreases the data file size. However, in both instances, the software uses the correct coding scheme (e.g., expand a nominal variable with J categories into J-1 dummy variables) as part of the estimation process in the background. Would that be a fair assessment of the situation?
@Skrikant, you are summarizing the situation correctly. I may have confused everyone with an earlier comment whereby I used "nominal" incorrectly. I thought this adjective related to real numbers so to speak. I realize that nominal means just a label which is what it should be when dealing with qualitative categorical variables.
Student T, I am not sure that is the case. Your example feels like seasonality. When you introduce seasonality variables in a time series model, you do not include dummy variables for 12 months in the model, you have to leave one out. But, let's say you are dealing with 3 different US States: CA, AZ, and TX. In such a case, I don't think you would have to leave one out. I am not entirely sure of my position. You are welcome to correct me if I am wrong. And, maybe my 3-state thing is a bit different than a 3-level categorical variable. Maybe in the later, you would do as you suggest.
As far as my understanding goes, it is good to use dummy variable for categorical/ nominal data while for a ordinal data we can use coding of 1,2,3 for different levels. For dummy variable we will be coding 1 if it is true for a particular onservation and 0 otherwise. Also dummy variables will be 1 less than the no. Of levels, for example in binary we have 1. An all '0' observation in dummy variable will automatically make 1 for the not coded dummy.