When conducting multiple regression, when should you center your predictor variables & when should you standardize them?

  • In some literature, I have read that a regression with multiple explanatory variables, if in different units, needed to be standardized. (Standardizing consists in subtracting the mean and dividing by the standard deviation.) In which other cases do I need to standardize my data? Are there cases in which I should only center my data (i.e., without dividing by standard deviation)?

    A related post in Andrew Gelman's blog.

    In addition to the great answers already given, let me mention that when using penalization methods such as ridge regression or lasso the result is no longer invariant to standardization. It is, however, often recommended to standardize. In this case not for reasons directly related to interpretations, but because the penalization will then treat different explanatory variables on a more equal footing.

    Welcome to the site @mathieu_r! You've posted two very popular questions. Please consider upvoting/accepting some of the excellent answers you've received to both questions ;)

    When I read this Q&A it reminded me of a usenet site I stumbled on many years ago http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-16.html This gives in simple terms some of the issues and considerations when one wants to normalize/standardize/rescale the data. I didn't see it mentioned anywhere on the answers here. It treats the subject from more of a machine learning perspective, but it might help someone coming here.

  • In regression, it is often recommended to center the variables so that the predictors have mean $0$. This makes it so the intercept term is interpreted as the expected value of $Y_i$ when the predictor values are set to their means. Otherwise, the intercept is interpreted as the expected value of $Y_i$ when the predictors are set to 0, which may not be a realistic or interpretable situation (e.g. what if the predictors were height and weight?). Another practical reason for scaling in regression is when one variable has a very large scale, e.g. if you were using population size of a country as a predictor. In that case, the regression coefficients may be on a very small order of magnitude (e.g. $10^{-6}$) which can be a little annoying when you're reading computer output, so you may convert the variable to, for example, population size in millions. The convention that you standardize predictions primarily exists so that the units of the regression coefficients are the same.

    As @gung alludes to and @MånsT shows explicitly (+1 to both, btw), centering/scaling does not affect your statistical inference in regression models - the estimates are adjusted appropriately and the $p$-values will be the same.

    Other situations where centering and/or scaling may be useful:

    • when you're trying to sum or average variables that are on different scales, perhaps to create a composite score of some kind. Without scaling, it may be the case that one variable has a larger impact on the sum due purely to its scale, which may be undesirable.

    • To simplify calculations and notation. For example, the sample covariance matrix of a matrix of values centered by their sample means is simply $X'X$. Similarly, if a univariate random variable $X$ has been mean centered, then ${\rm var}(X) = E(X^2)$ and the variance can be estimated from a sample by looking at the sample mean of the squares of the observed values.

    • Related to aforementioned, PCA can only be interpreted as the singular value decomposition of a data matrix when the columns have first been centered by their means.

    Note that scaling is not necessary in the last two bullet points I mentioned and centering may not be necessary in the first bullet I mentioned, so the two do not need to go hand and hand at all times.

    +1, these are good points I didn't think of. For clarity, let me list some concrete examples where a researcher might want to combine explanatory variables prior to running a regression, & thus need to standardize. One case might be for research into children's behavioral disorders; researchers might get ratings from both parents & teachers, & then want to combine them into a single measure of maladjustment. Another case could be a study on the activity level at a nursing home w/ self-ratings by residents & the number of signatures on sign-up sheets for activities.

    But shouldn't we in theory use the population mean and standard deviation for centering/scaling? In practice, is it as simple as using the sample mean/SD or is there more to it?

    For the sake of completeness, let me add to this nice answer that $X'X$ of the centered and standardized $X$ is the correlation matrix.

    @AlefSin: you may actually want to use something else than the population mean/sd, see my answer. But your point that we should think what to use for centering/scaling is very good.

    @AlefSin, all of my comments were made assuming you were using the sample mean/SD. If you center by the sample means the interpretation of the intercept is still the same, except it's the expected value of $Y_{i}$ when the predictors are set their **sample means**. The information in my three bullet points still applies when you center/scale by sample quantities. It's also worth noting that if you center by the sample mean, the result is a variable with mean 0 but scaling by the sample standard deviation does not, in general produce a result with standard deviation 1 (e.g. the t-statistic).

    @cbeleites which implies lowering correlation between estimates $\left\vert\mathrm{corr}(\beta_i, \beta_j)\right\vert$

    Is it a good idea to standarize variables that are very skewed or is it better just to standardize symmetrically distributed variables? Should we stardadize only the input variables or also the outcomes?

    Can you explain the last bullet? I get when you center the data matrix first, doing SVD ($U\Sigma V^T$) then the U matrix is just the eigenvectors of the covariance matrix of the centered data matrix. But if you don't center it, you can still interpret the SVD as the singulvar value decomposition vectors just that the matrix V contains those vectors now. Am I missing something here? Here is an article that I think nicely shows how without centering the data matrix you can still interpret it as such: https://jeremykun.com/2016/05/16/singular-value-decomposition-part-2-theorem-proof-algorithm/

  • You have come across a common belief. However, in general, you do not need to center or standardize your data for multiple regression. Different explanatory variables are almost always on different scales (i.e., measured in different units). This is not a problem; the betas are estimated such that they convert the units of each explanatory variable into the units of the response variable appropriately. One thing that people sometimes say is that if you have standardized your variables first, you can then interpret the betas as measures of importance. For instance, if $\beta_1=.6$, and $\beta_2=.3$, then the first explanatory variable is twice as important as the second. While this idea is appealing, unfortunately, it is not valid. There are several issues, but perhaps the easiest to follow is that you have no way to control for possible range restrictions in the variables. Inferring the 'importance' of different explanatory variables relative to each other is a very tricky philosophical issue. None of that is to suggest that standardizing is bad or wrong, just that it typically isn't necessary.

    The only case I can think of off the top of my head where centering is helpful is before creating power terms. Lets say you have a variable, $X$, that ranges from 1 to 2, but you suspect a curvilinear relationship with the response variable, and so you want to create an $X^2$ term. If you don't center $X$ first, your squared term will be highly correlated with $X$, which could muddy the estimation of the beta. Centering first addresses this issue.

    (Update added much later:) An analogous case that I forgot to mention is creating interaction terms. If an interaction / product term is created from two variables that are not centered on 0, some amount of collinearity will be induced (with the exact amount depending on various factors). Centering first addresses this potential problem. For a fuller explanation, see this excellent answer from @Affine: Collinearity diagnostics problematic only when the interaction term is included.

    If anyone is interested, I also talk about the mistaken idea of using standardized betas to infer relative 'importance' here: multiple-linear-regression-for-hypothesis-testing

    Belsley, Kuh, and Welsch have a thoughtful analysis of this situation in their 1980 book *Regression Diagnostics.* (See Appendix 3B for details.) They conclude you are incorrect that rescaling doesn't help. Their analysis is in terms of *numerical stability* of the solution procedure, which is measured in terms of the condition number of the data matrix $X$. That condition number can be very high when variables are measured on scales with disparate ranges. Rescaling will then absorb most of the "badness" in $X$ within the scale factors. The resulting problem will be much better conditioned.

    About beta1=0.6 and beta2=0.3, I'm not sure whether saying beta1 is as twice important as beta2 is appropriate, but I thought that since they're standardised they're on the same 'scale', i.e. units are standard deviations from the mean. Having said that, the response of Y will be twice higher in case of beta1 (holding x2 constant) than for beta2 (holding x1 constant). Right? Or have I misunderstand something on the way?

    @chao, you haven't really gotten rid of the units that are intrinsic to the 2 variables; you've just hidden them. Now, the units of X1 are per 13.9 cm, and the units of X2 are per 2.3 degrees Celsius.

    Some regression libraries such as lme4 ask you to standarize the variables when there are convergency problems.

    This answer should mention that standardizing is needed when using regularization, don't you think? It gives the impression that it's not.

    This answer on when R^2 is useful seems to say that scaling is useful to make the R^2 more accurately represent the variance in residuals: https://stats.stackexchange.com/a/13317/184050. Seems like another reason to scale?

    @skeller88, if you want to scale, go ahead. It doesn't really change anything. If you want the variance of the residuals, you could try to better approximate that with something else, or you could just get the variance of the residuals. Either way, do as you like.

  • In addition to the remarks in the other answers, I'd like to point out that the scale and location of the explanatory variables does not affect the validity of the regression model in any way.

    Consider the model $y=\beta_0+\beta_1x_1+\beta_2x_2+\ldots+\epsilon$.

    The least squares estimators of $\beta_1, \beta_2,\ldots$ are not affected by shifting. The reason is that these are the slopes of the fitting surface - how much the surface changes if you change $x_1,x_2,\ldots$ one unit. This does not depend on location. (The estimator of $\beta_0$, however, does.)

    By looking at the equations for the estimators you can see that scaling $x_1$ with a factor $a$ scales $\hat{\beta}_1$ by a factor $1/a$. To see this, note that




    By looking at the corresponding formula for $\hat{\beta}_2$ (for instance) it is (hopefully) clear that this scaling doesn't affect the estimators of the other slopes.

    Thus, scaling simply corresponds to scaling the corresponding slopes.

    As gung points out, some people like to rescale by the standard deviation in hopes that they will be able to interpret how "important" the different variables are. While this practice can be questioned, it can be noted that this corresponds to choosing $a_i=1/s_i$ in the above computations, where $s_i$ is the standard deviation of $x_1$ (which in a strange thing to say to begin with, since the $x_i$ are assumed to be deterministic).

    Is it a good idea to standarize variables that are very skewed or is it better just to standardize symmetrically distributed variables? Should we stardadize only the input variables or also the outcomes?

  • In case you use gradient descent to fit your model, standardizing covariates may speed up convergence (because when you have unscaled covariates, the corresponding parameters may inappropriately dominate the gradient). To illustrate this, some R code:

    > objective <- function(par){ par[1]^2+par[2]^2}  #quadratic function in two variables with a minimum at (0,0)
    > optim(c(10,10), objective, method="BFGS")$counts  #returns the number of times the function and its gradient had to be evaluated until convergence
        function gradient 
              12        3 
    > objective2 <- function(par){ par[1]^2+0.1*par[2]^2}  #a transformation of the above function, corresponding to unscaled covariates
    > optim(c(10,10), objective2, method="BFGS")$counts
    function gradient 
          19       10 
    > optim(c(10,1), objective2, method="BFGS")$counts  #scaling of initial parameters doesn't get you back to original performance
    function gradient 
          12        8

    Also, for some applications of SVMs, scaling may improve predictive performance: Feature scaling in support vector data description.

  • I prefer "solid reasons" for both centering and standardization (they exist very often). In general, they have more to do with the data set and the problem than with the data analysis method.

    Very often, I prefer to center (i.e. shift the origin of the data) to other points that are physically/chemically/biologically/... more meaningful than the mean (see also Macro's answer), e.g.

    • the mean of a control group

    • blank signal

    Numerical stability is an algorithm-related reason to center and/or scale data.

    Also, have a look at the similar question about standardization. Which also covers "center only".

  • To illustrate the numerical stability issue mentioned by @cbeleites, here is an example from Simon Wood on how to "break" lm(). First we'll generate some simple data and fit a simple quadratic curve.

    set.seed(1); n <- 100
    xx <- sort(runif(n))
    y <- .2*(xx-.5)+(xx-.5)^2 + rnorm(n)*.1
    x <- xx+100
    b <- lm(y ~ x+I(x^2))
    lines(x, predict(b), col='red')

    enter image description here

    But if we add 900 to X, then the result should be pretty much the same except shifted to the right, no? Unfortunately not...

    X <- x + 900
    B <- lm(y ~ X+I(X^2))
    lines(X, predict(B), col='blue')

    enter image description here

    Edit to add to the comment by @Scortchi - if we look at the object returned by lm() we see that the quadratic term has not been estimated and is shown as NA.

    > B
    lm(formula = y ~ X + I(X^2))
    (Intercept)            X       I(X^2)  
      -139.3927       0.1394           NA  

    And indeed as suggested by @Scortchi, if we look at the model matrix and try to solve directly, it "breaks".

    > X <- model.matrix(b) ## get same model matrix used above
    > beta.hat <- solve(t(X)%*%X,t(X)%*%y) ## direct solution of ‘normal equations’
    Error in solve.default(t(X) %*% X, t(X) %*% y) : 
      system is computationally singular: reciprocal condition number = 3.9864e-19

    However, lm() does not give me any warning or error message other than the NAs on the I(X^2) line of summary(B) in R-3.1.1. Other algorithms can of course be "broken" in different ways with different examples.

    (+1) Note `lm` fails to estimate a coefficient for the quadratic term, & gives a warning about a singular design matrix - perhaps more directly illustrative of the problem than these plots.

    How can we understand the reason behind this "break"? Is it just due to rounding error / floating-point arithmetic?

  • I doubt seriously whether centering or standardizing the original data could really mitigate the multicollinearity problem when squared terms or other interaction terms are included in regression, as some of you, gung in particular, have recommend above.

    To illustrate my point, let's consider a simple example.

    Suppose the true specification takes the following form such that


    Thus the corresponding OLS equation is given by


    where $\hat{y_i}$ is the fitted value of $y_i$, $u_i$ is the residual, $\hat{b_0}$-$\hat{b_2}$ denote the OLS estimates for $b0$-$b2$ – the parameters that we are ultimately interested in. For simplicity, let $z_i=x_i^2$ thereafter.

    Usually, we know $x$ and $x^2$ are likely to be highly correlated and this would cause the multicollinearity problem. To mitigate this, a popular suggestion would be centering the original data by subtracting mean of $y_i$ from $y_i$ before adding squared terms.

    It is fairly easy to show that the mean of $y_i$ is given as follows: $$\bar{y}=\hat{b_0}+\hat{b_1} \bar{x}+\hat{b_2} \bar{z}$$ where $\bar{y}$, $\bar{x}$, $\bar{z}$ denote means of $y_i$, $x_i$ and $z_i$, respectively.

    Hence, subtracting $\bar{y}$ from $y_i$ gives


    where $y_i-\bar{y}$, $x_i-\bar{x}$, and $z_i-\bar{z}$ are centered variables. $\hat{b_1}$ and $\hat{b_2}$ – the parameters to be estimated, remain the same as those in the original OLS regression.

    However, it is clear that in my example, centered RHS-variables $x$ and $x^2$ have exactly the same covariance/correlation as the uncentered $x$ and $x^2$, i.e. $\text{corr}(x, z)=\text{corr}(x-\bar{x}, z-\bar{z})$.

    In summary, if my understanding on centering is correct, then I do not think centering data would do any help to mitigate the MC-problem caused by including squared terms or other higher order terms into regression.

    I'd be happy to hear your opinions!

    Thanks for your contribution, @rudi0086021. You may be right, but I see a couple of issues here. 1st, centering is about subtracting the mean of **x**, not about subtracting the mean of **y**; 2nd, you need to center first, centering afterwords has no effect as you note. Consider: `x = c(1,2,3); x2 = x^2; cor(x, x2); # [1] 0.9897433; xc = c(-1,0,1); xc2 = xc^2; cor(xc, xc2) # [1] 0`.

    Thank you for your reply, @gung. Here is my thoughts. Firstly, personally I saw no convincing reason to treat dependent and independent variables differently, that is, to independent variables, while not to do so for dependent variables.

    Secondly, as you said, perhaps we should center the data before creating squared terms. Such a practice will mitigate the MC problem. However, it could lead to biased estimates, or more concretely, the omitted variable bias (OVB). To illustrate, see the following example: suppose the true specification is: y=b0+b1*x+b2*x^2+u. Centering the data beforehand will give: y=b0+b1*(x-xhar)+b2*(x-xbar)^2+v, where the new error term v=u+b1*xbar-b2*xbar^2+2b2*xbar*x. It is clear that cov(x-xbar, v)!=0. Thus, unfortunately, centering data beforehand would lead to biased estimates.

    @rudi0086021 It looks like in your last comment you assume that you would get the same coefficients when fitting the centered data as you would have when fitting the uncentered data. But centering before taking the square isn't a simple shift by a constant, so one shouldn't expect to get the same coefficients. The best fit after centering is given by B0 + B1*(x-xbar) + B2*(x-xbar)^2 where B0 = b0 + b1*xbar + b2*xbar^2, B1 = b1 + 2*b2*xbar and B2 = b2. Thus, v = u. Sorry to respond to this comment so belatedly, but there could always be others like me who see it for the first time today.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM