### What's the difference between Normalization and Standardization?

At work we were discussing this as my boss has never heard of normalization. In Linear Algebra, Normalization seems to refer to the dividing of a vector by its length. And in statistics, Standardization seems to refer to the subtraction of a mean then dividing by its SD. But they seem interchangeable with other possibilities as well.

When creating some kind of universal score, that makes up $2$ different metrics, which have different means and different SD's, would you Normalize, Standardize, or something else? One person told me it's just a matter of taking each metric and dividing them by their SD, individually. Then summing the two. And that will result in a universal score that can be used to judge both metrics.

For instance, say you had the number of people who take the subway to work (in NYC) and the number of people who drove to work (in NYC).

$$\text{Train} \longrightarrow x$$ $$\text{Car} \longrightarrow y$$

If you wanted to create a universal score to quickly report traffic fluctuations, you can't just add $\text{mean}(x)$ and $\text{mean}(y)$ because there will be a LOT more people who ride the train. There's 8 million people living in NYC, plus tourists. That's millions of people taking the train everyday verse hundreds of thousands of people in cars. So they need to be transformed to a similar scale in order to be compared.

If $\text{mean}(x) = 8,000,000$

and $\text{mean}(y) = 800,000$

Would you normalize $x$ & $y$ then sum? Would you standardize $x$ & $y$ then sum? Or would you divide each by their respective SD then sum? In order to get to a number that when fluctuates, represents total traffic fluctuations.

Any article or chapters of books for reference would be much appreciated. THANKS!

Also here's another example of what I'm trying to do.

Imagine you're a college dean, and you're discussing admission requirements. You may want students with at least a certain GPA and a certain test score. It'd be nice if they were both on the same scale because then you could just add the two together and say, "anyone with at least a 7.0 can get admitted." That way, if a prospective student has a 4.0 GPA, they could get as low as a 3.0 test score and still get admitted. Inversely, if someone had a 3.0 GPA, they could still get admitted with a 4.0 test score.

But it's not like that. The ACT is on a 36 point scale and most GPA's are on 4.0 (some are 4.3, yes annoying). Since I can't just add an ACT and GPA to get some kind of universal score, how can I transform them so they can be added, thus creating a universal admission score. And then as a Dean, I could just automatically accept anyone with a score above a certain threshold. Or even automatically accept everyone whose score is within the top 95%.... those sorts of things.

Would that be normalization? standardization? or just dividing each by their SD then summing?

Would you please clarify the problem of subway? why would "There's 8 million people living in NYC, plus tourists" derive $\text{mean}(x) = 8,000,000$?

Normalization rescales the values into a range of [0,1]. This might be useful in some cases where all parameters need to have the same positive scale. However, the outliers from the data set are lost.

$$ X_{changed} = \frac{X - X_{min}}{X_{max}-X_{min}} $$

Standardization rescales data to have a mean ($\mu$) of 0 and standard deviation ($\sigma$) of 1 (unit variance).

$$ X_{changed} = \frac{X - \mu}{\sigma} $$

For most applications standardization is recommended.

Could you please explain why "the outliers from the data set are lost" upon normalization of data?

outliers in this case of re-scaling would affect the result and not got lost.

@learner Imagine if you have [1 2 3 4 5 1000 2 4 5 2000 ...]. The normalized value of 1000 datapoint would become smaller because we have 2000

@COLDICE I think it depends on the normalization algorithm you use. For instance, if you divided every number in your dataset by the max value (e.g 2000), they would range between 0 and 1, and it wouldn't affect outliers.

I think this doesn't affect outliers at all, otherwise this wouldn't be done in anomaly detection softwares.

Let's say, outlier won't be lost. instead, the precision and distinction between numbers would blur and they would seem to be less sparse after regeneration on multiplying with the normalization factor

The outliers are not lost. Outliers typically implies samples beyond X percentile of the distribution. All you do with min max normalization is to scale the values. The relative distance between samples would stay the same.

In the business world, "normalization" typically means that the range of values are "normalized to be from 0.0 to 1.0". "Standardization" typically means that the range of values are "standardized" to measure how many standard deviations the value is from its mean. However, not everyone would agree with that. It's best to explain

**your definitions**before you use them.In any case, your transformation needs to provide something useful.

In your train/car example, do you gain anything out of knowing how many standard deviations from their mean, each value lies? If you plot those "standardized" measures against each other as an x-y plot, you might see a correlation (see the first graph on the right):

http://en.wikipedia.org/wiki/Correlation_and_dependence

If so, does that mean anything to you?

As far as your second example goes, if you want to "equate" a GPA from one scale to another scale, what do these scales have in common? In other words, how could you transform those minimums to be equivalent, and the maximums to be equivalent?

Here's an example of "normalization":

Once you get your GPA and ACT scores in an interchangeable form, does it make sense to weigh the ACT and GPA scores differently? If so, what weighting means something to you?

Edit 1 (05/03/2011) ==========================================

First, I would check out the links suggested by

**whuber**above. The bottom line is, in both of your two-variable problems, you are going to have to come up with an "equivalence" of one variable versus the other. And, a way to differentiate one variable from the other. In other words, even if you can simplify this to a simple linear relationship, you'll need "weights" to differentiate one variable from the other.Here's an example of a two variable problem:

From the last page, if you can say that standardized train traffic

`U1(x)`

versus standardized car traffic`U2(y)`

is "additively independent", then you might be able to get away with a simple equation such as:`U(x, y) = k1*U1(x) + (1 - k1)*U2(y)`

Where k1=0.5 means you're indifferent to standardized car/train traffic. A higher k1 would mean train traffic

`U1(x)`

is more important.However, if these two variables are not "additively independent", then you'll have to use a more complicated equation. One possibility is shown on page 1:

`U(x, y) = k1*U1(x) + k2*U2(y) + (1-k1-k2)*U1(x)*U2(y)`

In either case, you'll have to come up with a utility

`U(x, y)`

that makes sense.The same general weighting/comparison concepts hold for your GPA/ACT problem. Even if they are "normalized" rather than "standardized".

One last issue. I know you're not going to like this, but the definition of the term "additively independent" is on page 4 of the following link. I looked for a less geeky definition, but I couldn't find one. You might look around to find something better.

Quoting the link:

`Intuitively, the agent prefers being both healthy and wealthy more than might be suggested by considering the two attributes separately. It thus displays a preference for probability distributions in which health and wealth are positively correlated.`

As suggested at the top of this response, if you plot standardized train traffic versus standardized car traffic on an x-y plot, you might see a correlation. If so, then you're stuck with the above non-linear utility equation or something similar.

Ok. You're right. It is best to explain my definitions. And in thinking about it again, it's not the definitions that I need. What I need is the appropriate method for creating 1 universal score. Whether that be an Admission score or Traffic score. How does one go about creating a universal metric that's a function of other variables, which were transformed to put them both on a similar scale? And don't worry about the weights. I understand that even just straight summing is weighting the metrics 1/1. But that's less of a concern for me right now.

@Chris, I added my answer as an edit above.

(+1) Good edit. @Chris: you might be interested in the notes to a short set of PowerPoint slides here: this is a presentation on the subject I gave to non-technical people. I mention it because it has some illustrations and guidance for how to "create a universal metric."

Multi-Attribute Utilities link is dead, article can be found here https://web.archive.org/web/20090530032248/http://www.doc.ic.ac.uk/~frk/frank/da/6.%20multiple%20utility.pdf

The answer is simple, but you're not going to like it: it depends. If you value 1 standard deviation from both scores equally, then standardization is the way to go (note: in fact, you're studentizing, because you're dividing by an

*estimate*of the SD of the population).If not, it is likely that standardization will be a good first step, after which you can give more weight to one of the score by multiplying by a wellchosen factor.

So, you're saying at least start with what I described as Standardization (studentizing), then adjust the weights to best fit the data/scenario? That makes sense. I just don't understand why I would divide by the SD. And in researching I found something called the Standardized Mean Difference.... and I've just been confusing myself. It seems like it should be simple. You either put them both on Scale-A, or one on the same scale as the other, then sum. But no. Instead I'm confused and all Wiki'd out for the moment.

To solve the GPA/ACT or train/car problem, why not use the

**Geometric Mean**?**n√(a1 × a2 × ... × an)**Where

`a*`

is the value from the distribution and`n`

is the index of the distribution.This geometric mean makes sure that each value dispite its scale, equally contributes to the mean value. See more at Geometric Mean

I don't see that the geometric mean would be appropriate for the situations the OP describes.

I agree with gung. Geometric mean is not a solution of this problem.

Geometric mean will prevent reduction of contribution of smaller numbers. Hence it may be an alternative to standardization or normalization when unequal scales have to be combined.

In my field, data science, normalization is a transformation of data which allows easy comparison of the data downstream. There are many types of normalizations. Scaling being one of them. You can also log the data, or do anything else you want. The type of normalisation you use would depend on the outcome you want, since all normalisations transform the data into something else.

Here some of what I consider normalization examples. Scaling normalisations Quantile normalisation

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

whuber 9 years ago

The last part of the question sounds like you are trying to create a *valuation* out of *multiple attributes.* For more on that see the question and replies at http://stats.stackexchange.com/q/9137 and http://stats.stackexchange.com/q/9358 . In particular, note that neither normalization nor standardization have any direct relevance to the Dean's problem.