When is it ok to remove the intercept in a linear regression model?
I am running linear regression models and wondering what the conditions are for removing the intercept term.
In comparing results from two different regressions where one has the intercept and the other does not, I notice that the $R^2$ of the function without the intercept is much higher. Are there certain conditions or assumptions I should be following to make sure the removal of the intercept term is valid?
@chi thanks for editing my question. are there things that I should be clarifying or rewording in any future questions?
Your question is well stated. @chl kindly improved some formatting, that's all. It involved TeXifying the "R^2" (it was turned into $\$$R^2$\$$, which renders as $R^2$).
What would the intercept mean in your model? From the information in your question, it seems it would be the expected value of your response when sqft=0 and lotsize=0 and baths=0. Is that ever going to occur in reality?
the intercept has no meaning. so when writing up a formula for the expected value of a house, can I leave this out?
**NB**: Some of these comments and replies address essentially the same question (framed in the context of a housing price regression) which was merged with this one as a duplicate.
The shortest answer: never, unless you are sure that your linear approximation of the data generating process (linear regression model) either by some theoretical or any other reasons is forced to go through the origin. If not the other regression parameters will be biased even if intercept is statistically insignificant (strange but it is so, consult Brooks Introductory Econometrics for instance). Finally, as I do often explain to my students, by leaving the intercept term you insure that the residual term is zero-mean.
For your two models case we need more context. It may happen that linear model is not suitable here. For example, you need to log transform first if the model is multiplicative. Having exponentially growing processes it may occasionally happen that $R^2$ for the model without the intercept is "much" higher.
Screen the data, test the model with RESET test or any other linear specification test, this may help to see if my guess is true. And, building the models highest $R^2$ is one of the last statistical properties I do really concern about, but it is nice to present to the people who are not so well familiar with econometrics (there are many dirty tricks to make determination close to 1 :)).
@Curious, "never" is written with "unless" examples below just show the exceptions when it is legal to remove intercept. When you don't know the data generating process or theory, or are not forced to go through the origin by standardization or any other special model, keep it. Keeping intercept is like using the trash bin to collect all the distortions caused by linear approximation and other simplifications. P.S. practically the response shows that you read just shortest :) Thanks a lot to Joshua (+1) for the extended examples.
You missed the point of Joshua Example 1 and seem to still ignore it completely. In models with categorical covariate the removal of the intercept results in the same model with just different parametrization. This is a legitimate case when intercept can be removed.
@Curious, in Joshua example 1, you need to add a new dummy variable for the level of the categorical variable you previously considered as baseline, and this new dummy variable will take the value of the intercept, so you are NOT removing the intercept, just renaming it and reparameterizing the rest of the parameters of the categorical covariate. Therefore the argument of Dmitrij holds.