Choosing variables to include in a multiple linear regression model
I am currently working to build a model using a multiple linear regression. After fiddling around with my model, I am unsure how to best determine which variables to keep and which to remove.
My model started with 10 predictors for the DV. When using all 10 predictors, four were considered significant. If I remove only some of the obviously-incorrect predictors, some of my predictors that were not initially significant become significant. Which leads me to my question: How does one go about determining which predictors to include in their model? It seemed to me you should run the model once with all predictors, remove those that are not significant, and then rerun. But if removing only some of those predictors makes others significant, I am left wondering if I am taking the wrong approach to all this.
I believe that this thread is similar to my question, but I am unsure I am interpreting the discussion correctly. Perhaps this is more of an experimental design topic, but maybe someone has some experience they can share.
The answer to this depends highly upon your goals and requirements: are you looking for simple association, or are you aiming for prediction; how high are you on interpretability; do you have any information on the variables from other publications that could influence the process; how about interactions or tranformed versions of the variables: can you include those; etc. You need to specify more details on what you're trying to do to get a good answer.
Based on what you asked, this will be for prediction. Influence on other variables just offers possible association. There are no interactions between them. Only one value needs to be transformed, and it has been done.
Is there a theory that says what predictors you should include? If you have a lot of variables that you have measured, and no theory, I would recommend holding out a set of observations so you can test your model on data that was not used to create it. It is not correct to test and validate a model on the same data.
Cross validation (as Nick Sabbe discusses), penalized methods (Dikran Marsupial), or choosing variables based on prior theory (Michelle) are all options. But note that variable selection is intrinsically a very difficult task. To understand why it is so potentially fraught, it may help to read my answer here: algorithms-for-automatic-model-selection. Lastly, it's worth recognizing the problem is w/ the logical structure of this activity, not whether the computer does it for you automatically, or you do it manually for yourself.
Check also out answers to this post: http://stats.stackexchange.com/questions/34769/regression-detecting-significant-predictors-out-of-300-independent-variables
Based on your reaction to my comment:
You are looking for prediction. Thus, you should not really rely on (in)significance of the coefficients. You would be better to
- Pick a criterion that describes your prediction needs best (e.g. missclassification rate, AUC of ROC, some form of these with weights,...)
- For each model of interest, evaluate this criterion. This can be done e.g.by providing a validation set (if you're lucky or rich), through crossvalidation (typically tenfold), or whatever other options your criterion of interest allows. If possible also find an estimate of the SE of the criterion for each model (e.g. by using the values over the different folds in crossvalidation)
- Now you can pick the model with the best value of the criterion, though it is typically advised to pick the most parsimoneous model (least variables) that is within one SE of the best value.
Wrt each model of interest: herein lies quite a catch. With 10 potential predictors, that is a truckload of potential models. If you've got the time or the processors for this (or if your data is small enough so that models get fit and evaluated fast enough): have a ball. If not, you can go about this by educated guesses, forward or backward modelling (but using the criterion instead of significance), or better yet: use some algorithm that picks a reasonable set of models. One algorithm that does this, is penalized regression, in particular Lasso regression. If you're using R, just plug in the package glmnet and you're about ready to go.
+1, but could you explain why exactly you would "pick the most parsimoneous model (least variables) that is within one SE of the best value"?
Parsimony is, for most situations, a wanted property: it heightens interpretability, and reduces the number of measurements you need to make for a new subject to use the model. The other side of the story is that what you get for your criterion is but an estimate, with matching SE: I've seen quite a few plots showing the criterion estimates against some tuning parameter, where the 'best' value was just an exceptional peak. As such, the 1 SE-rule (which is arbitrary, but an accepted practice) protects you from this with the added value of providing more parsimony.