Why square the difference instead of taking the absolute value in standard deviation?

  • In the definition of standard deviation, why do we have to square the difference from the mean to get the mean (E) and take the square root back at the end? Can't we just simply take the absolute value of the difference instead and get the expected value (mean) of those, and wouldn't that also show the variation of the data? The number is going to be different from square method (the absolute-value method will be smaller), but it should still show the spread of data. Anybody know why we take this square approach as a standard?

    The definition of standard deviation:

    $\sigma = \sqrt{E\left[\left(X - \mu\right)^2\right]}.$

    Can't we just take the absolute value instead and still be a good measurement?

    $\sigma = E\left[|X - \mu|\right]$

    In a way, the measurement you proposed is widely used in case of error (model quality) analysis -- then it is called MAE, "mean absolute error".

    In accepting an answer it seems important to me that we pay attention to whether the answer is circular. The normal distribution is based on these measurements of variance from squared error terms, but that isn't in and of itself a justification for using (X-M)^2 over |X-M|.

    Do you think the term standard means this is THE standard today ? Isn't it like asking why principal component are "principal" and not secondary ?

    My understanding of this question is that it could be shorter just be something like: what is the difference between the MAE and the RMSE ? otherwise it is difficult to deal with.

    Related question: http://stats.stackexchange.com/q/354/919 ("Bias towards natural numbers in the case of least squares.")

    *"the absolute-value method will be smaller"*, actually, it'll be bigger for small variances - it'll always be closer to 1 though (unless it is 1 or 0)

    Despite the antiquity of this question, I've posted a new answer, which says something that I think is worth knowing about.

    This following article has the pictorial & easy-to-understand explanation. http://www.mathsisfun.com/data/standard-deviation.html Thanks, Rajesh.

    Every answer offered so far is circular. They focus on ease of mathematical calculations (which is nice but by no means fundamental) or on properties of the Gaussian (Normal) distribution and OLS. Around 1800 Gauss *started* with least squares and variance and from those *derived* the Normal distribution--there's the circularity. A truly fundamental reason that has not been invoked in any answer yet is the *unique* role played by the variance in the Central Limit Theorem. Another is the importance in decision theory of minimizing quadratic loss.

    +1 @whuber: Thanks for pointing this out, which was bothering me as well. Now, though, have to go and read up on the Central Limit Theorem! Oh well. ;-)

    Taleb makes the case http://edge.org/response-detail/25401" target="_blank">at Edge.org for retiring standard deviation and using mean absolute deviation.

    @c4il will you please cite the source for the formula of S.D. qouted by you. I do think that it is incorrect.

    @rpierce would you please check the correctness of formula of s.d . under definition while asking question.

    @subhash c. davar, the notation isn't in a form I'm familiar with. However, OP defines E as the process of getting the mean, so IMO the equations check out.

    @subhashc.davar: The missing definition is for the expectation of the random variable $X$, $\mu=\operatorname{E}[X]$. (It's so commonly used that it's no more than a venial sin to let us guess what it means from the context.) Wikipedia will serve as a reference for the definition of standard deviation: https://en.wikipedia.org/wiki/Standard_deviation#Definition_of_population_values. Note the distinction between the standard deviation of a distribution/population & an estimate of it that may be calculated from a sample.

    @whuber Could you clarify relation to *CLT* specifically? Is variance the only non-zero functional $f$ s.t. $f(\sqrt n (\bar X_n-EX))=f(X)$?

    @A.S. Sure--I have answered this question in some detail at http://stats.stackexchange.com/a/3904. Briefly, there are infinitely many such functionals--but they must all asymptotically converge to the variance.

    @whuber What do you mean by "asymptotically converge"? Are you considering convergence of separate $f_n$ (defined for each $n$) rather than a single $f$ that satisfies the above for all $n$? // I'll read the post.

    Finding out that the variance uses squared by definition satisfied me. The moments of distribution are measurements which are defined by the powers of the differences: mean (^1) , variance (^2), skewness (^3), and kurtosis (^4). The variance can be particularly useful (many of the reasons are mentioned in this post; numbers further away have more weight, etc).

    @whuber "Another is the importance in decision theory of minimizing quadratic loss." What's the significance of quadratic loss in particular?

    @user76284 In a neighborhood of any local minimum of a continuously differentiable function, the function is closely approximated by a quadratic. Thus, many properties associated with a purely quadratic loss either hold exactly or at least approximately for a huge class of losses.

    @whuber: Have you considered writing your own answer about the role of variance in the CLT?

  • If the goal of the standard deviation is to summarise the spread of a symmetrical data set (i.e. in general how far each datum is from the mean), then we need a good method of defining how to measure that spread.

    The benefits of squaring include:

    • Squaring always gives a positive value, so the sum will not be zero.
    • Squaring emphasizes larger differences—a feature that turns out to be both good and bad (think of the effect outliers have).

    Squaring however does have a problem as a measure of spread and that is that the units are all squared, whereas we might prefer the spread to be in the same units as the original data (think of squared pounds, squared dollars, or squared apples). Hence the square root allows us to return to the original units.

    I suppose you could say that absolute difference assigns equal weight to the spread of data whereas squaring emphasises the extremes. Technically though, as others have pointed out, squaring makes the algebra much easier to work with and offers properties that the absolute method does not (for example, the variance is equal to the expected value of the square of the distribution minus the square of the mean of the distribution)

    It is important to note however that there's no reason you couldn't take the absolute difference if that is your preference on how you wish to view 'spread' (sort of how some people see 5% as some magical threshold for $p$-values, when in fact it is situation dependent). Indeed, there are in fact several competing methods for measuring spread.

    My view is to use the squared values because I like to think of how it relates to the Pythagorean Theorem of Statistics: $c = \sqrt{a^2 + b^2}$ …this also helps me remember that when working with independent random variables, variances add, standard deviations don't. But that's just my personal subjective preference which I mostly only use as a memory aid, feel free to ignore this paragraph.

    A much more in-depth analysis can be read here.

    "Squaring always gives a positive value, so the sum will not be zero." and so does absolute values.

    @robin girard: That is correct, hence why I preceded that point with "The benefits of squaring include". I wasn't implying that anything about absolute values in that statement. I take your point though, I'll consider removing/rephrasing it if others feel it is unclear.

    Much of the field of robust statistics is an attempt to deal with the excessive sensitivity to outliers that that is a consequence of choosing the variance as a measure of data spread (technically scale or dispersion). http://en.wikipedia.org/wiki/Robust_statistics

    The article linked to in the answer is a god send.

    I think the paragraph about Pythagoras is spot on. You can think of the error as a vector in $n$ dimensions, with $n$ being the number of samples. The size in each dimension is the difference from the mean for that sample. $[(x_1-\mu), (x_2-\mu), (x_3-\mu), ...]$ The length of that vector (Pythagoras) is the root of summed squares, i.e. the standard deviation.

    @ArneBrasseur But if you think of that `n` dimensional vector, you should use a n-distance not a 2D one then...

    @Guimoute Euclidean distance ("2-distance") is natural in general n-dimensional space (perhaps notwithstanding the curse of high dimensionality) EDIT: see the second figure here https://stats.stackexchange.com/a/427611/11472

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM