### How to check for normal distribution using Excel for performing a t-test?

I want to know

**how to check a data set for normality in Excel, just to verify that the requirements for using a t-test are being met**.For the right tail, is it appropriate to just calculate a mean and standard deviation, add 1, 2 & 3 standard deviations from the mean to create a range then compare that to the normal 68/95/99.7 for the standard normal distribution after using the norm.dist function in excel to test each standard deviation value.

Or is there a better way to test for normality?

You have the right idea. This can be done systematically, comprehensively, and with relatively simple calculations. A graph of the results is called a

*normal probability plot*(or sometimes a P-P plot). From it you can see*much*more detail than appears in other graphical representations, especially histograms, and with a little practice you can even learn to determine ways to re-express your data to make them closer to Normal in situations where that is warranted.Here is an example:

Data are in column

`A`

(and named`Data`

). The rest is all calculation, although you can control the "hinge rank" value used to fit a reference line to the plot.This plot is a scatterplot comparing the data to values that would be attained by numbers drawn independently from a standard Normal distribution. When the points line up along the diagonal, they are close to Normal; horizontal departures (along the data axis) indicate departures from normality. In this example the points are remarkably close to the reference line; the largest departure occurs at the highest value, which is about $1.5$ units to the left of the line. Thus we see at a glance that these data are very close to Normally distributed but perhaps have a slightly "light" right tail. This is perfectly fine for applying a t-test.

The comparison values on the vertical axis are computed in two steps. First each data value is ranked from $1$ through $n$, the amount of data (shown in the

`Count`

field in cell`F2`

). These are proportionally converted to values in the range $0$ to $1$. A good formula to use is $\left(\text{rank}-1/6\right)/\left(n+2/3\right).$ (See http://www.quantdec.com/envstats/notes/class_02/characterizing_distributions.htm for where that comes from.) Then these are converted to standard Normal values via the`NormSInv`

function. These values appear in the`Normal score`

column. The plot at the right is an XY scatterplot of`Normal Score`

against the data. (In some references you will see the transpose of this plot, which perhaps is more natural, but Excel prefers to place the leftmost column on the horizontal axis and the rightmost column on the vertical axis, so I have let it do what it prefers.)(As you can see, I simulated these data with independent random draws from a Normal distribution with mean $5$ and standard deviation $2$. It is therefore no surprise that the probability plot looks so nice.) There really are only two formulas to type in, which you propagate downward to match the data: they appear in cells

`B2:C2`

and rely on the`Count`

value computed in cell`F2`

. That's really all there is to it, apart from the plotting.The rest of this sheet is not necessary but it's helpful for judging the plot: it provides a robust estimate of a reference line. This is done by picking two points equally far in from the left and right of the plot and connecting them with a line. In the example these points are the third lowest and third highest, as determined by the $3$ in the

`Hinge Rank`

cell,`F3`

. As a bonus, its slope and intercept are robust estimates of the standard deviation and mean of the data, respectively.To plot the reference line, two extreme points are computed and added to the plot: their calculation occurs in columns

`I:J`

, labeled`X`

and`Y`

.For the formulas in Col B, would you explain the reason to add 1 and divide by 6 and 3 (“+ 1/6” and the “+ 1/3”)? Also is there a reason you chose to divide by 6 in the Hinge Rank Cell?

@Michael Good questions. $1/6$ is one simple, well-known way to establish probability plotting points. I recall John Tukey recommending this in his book, *EDA*. The hinge rank formula is mysterious: I should have explained that I am picking two points equally far in from the ends at the $100\times 1/6$ and $100\times 5/6$ percentiles. Any multiplier substantially greater than $0$ and less than $1/2$ would work. $1/4$ is popular: it corresponds to the quartiles. So is $0.16$, corresponding to 1 SD for a Normal distribution.

The formula (rank + 1/6) / (n+1/3) doesn't seem to be symmetric as we might anticipate. e.g. with the middle observation of 3 the rank is 2 and this would suggest a corresponding percentile of 0.65 rather than what would seem to be natural to take for the middle observation (0.5). Did I miss something obvious? [ I've seen Tukey use a few different formulas in different places, including (i-1/3)/(n+1/3). The formula at your link fits into the common (i-a)/(n+1-2a) scheme but the formula you give in your answer doesn't]

@Glen_b Thank you for catching that. I had misquoted the formula. My source is (my own page!) at http://www.quantdec.com/envstats/notes/class_02/characterizing_distributions.htm, where the correct formula is given and justified: generally one uses $$\frac{r-a}{n+1-2a}$$ where $r$ is the rank (from $1$ through $n$) and $a$ typically is a number between $0$ and $1$, often $1/6$. That yields $(r-1/6)/(n+2/3)$ as the desired formula. ($a=1/3$ gives the Tukey formula you quote.) I have fixed the formula and the Excel illustration.

You could plot a histogram using the data analysis toolpack in Excel. Graphical approaches are more likely to communicate the degree of non-normality, which is typically more relevant for assumption testing (see this discussion of normality).

The data analysis toolpack in Excel will also give you skewness and kurtosis if you ask for descriptive statistics and choose the "summary statistics" option. You might for example consider values of skewness above plus or minus one be a form of substantive non-normality.

That said, the assumption with t-tests is that the residuals are normally distributed and not the variable. Furthermore, they also quite robust such that even with fairly large amounts of non-normality, p-values are still fairly valid.

This question borders on statistics theory too - testing for normality with limited data may be questionable (although we all have done this from time to time).

As an alternative, you can look at kurtosis and skewness coefficients. From

**Hahn and Shapiro: Statistical Models in Engineering**some background is provided on the properties Beta1 and Beta2 (pages 42 to 49) and the Fig 6-1 of Page 197. Additional theory behind this can be found on Wikipedia (see Pearson Distribution).Basically you need to calculate the so-called properties Beta1 and Beta2. A Beta1 = 0 and Beta2 = 3 suggests that the data set approaches normality. This is a rough test but with limited data it could be argued that any test could be considered a rough one.

Beta1 is related to the moments 2 and 3, or variance and skewness, respectively. In Excel, these are VAR and SKEW. Where ... is your data array, the formula is:

`Beta1 = SKEW(...)^2/VAR(...)^3`

Beta2 is related to the moments 2 and 4, or the variance and kurtosis, respectively. In Excel, these are VAR and KURT. Where ... is your data array, the formula is:

`Beta2 = KURT(...)/VAR(...)^2`

Then you can check these against the values of 0 and 3, respectively. This has the advantage of potentially identifying other distributions (including Pearson Distributions I, I(U), I(J), II, II(U), III, IV, V, VI, VII). For example, many of the commonly used distributions such as Uniform, Normal, Student's t, Beta, Gamma, Exponential, and Log-Normal can be indicated from these properties:

`Where: 0 <= Beta1 <= 4 1 <= Beta2 <= 10 Uniform: [0,1.8] [point] Exponential: [4,9] [point] Normal: [0,3] [point] Students-t: (0,3) to [0,10] [line] Lognormal: (0,3) to [3.6,10] [line] Gamma: (0,3) to (4,9) [line] Beta: (0,3) to (4,9), (0,1.8) to (4,9) [area] Beta J: (0,1.8) to (4,9), (0,1.8) to [4,6*] [area] Beta U: (0,1.8) to (4,6), [0,1] to [4.5) [area] Impossible: (0,1) to (4.5), (0,1) to (4,1] [area] Undefined: (0,3) to (3.6,10), (0,10) to (3.6,10) [area] Values of Beta1, Beta2 where brackets mean: [ ] : includes (closed) ( ) : approaches but does not include (open) * : approximate`

These are illustrated in Hahn and Shapiro Fig 6-1.

Granted this is a very rough test (with some issues) but you may want to consider it as a preliminary check before going to a more rigorous method.

There are also adjustment mechanisms to the calculation of Beta1 and Beta2 where data is limited - but that is beyond this post.

A lot of this material can work well for large datasets and I agree with your initial assessment that normality testing can be limited or questionable with small datasets. But given the *huge* variability of skewness and kurtosis, it would seem that any effort to identify the type of underlying distribution based on these statistics would be even more questionable and less certain. Consequently, wouldn't this approach be (at best) misleading even as a preliminary check?

Perhaps it is best to qualify the method further: Hahn and Shapiro (as referenced above) advise that caution should be exercised, especially when the sample size is less than 200 - and recommend that this be followed by further verification, such as a frequency table that compares the fitted distribution with the actual data. But in my view it is a useful method that *can suggest* where the data might lay within a spectrum of possibilities. I have used it on data sets not smaller than roughly 3000 and have built it into computer simulation software where it has proved useful.

I can see your method giving useful information with datasets of 3000 or greater. However, then there is no need to perform distributional testing to assess the applicability of a t-test of the mean.

Whether one views this as a useful technique, as I do, or otherwise, as appears to be your view, it nonetheless is a quick and long-established (by Pearson) alternative to testing for normality (and Students-t application) in the context of this thread. Please don't get me wrong, I acknowledge and do agree with your concerns. But we would both agree, would we not, that without prior information, trying to establish whether an entire population can be modelled on a Gaussian from a very small data sample is a shot in the dark at best with any method, and at worst is dangerous.

That's right. All I am saying is that if it is dangerous to try, from a small sample, to test whether the population is Gaussian, then it must be at least as dangerous to use the skewness and kurtosis to identify what the underlying distribution might be! In fact, it seems like such an attempt would actually be worse because it relies on unstable statistics like the kurtosis. Although Pearson's system can be a powerful guide to help people identify possible distributions, it provides less insight than even limited graphical displays like histograms.

It can be argued that the Student's-t should not be applied unless there is prior information that the population is in fact Gaussian, as it is designed for less than 30 samples anyway. The Students-t is in essence a way to 'narrow the spread' of a predicted population as a small number of samples is increased, but the assumption is that the population must be Gaussian to start with.

That argument continues to take place. Going back at least to Box 50 years ago, many have pointed out that the t-test actually is an approximation to a non-parametric (permutation) test: it does not require Normality of the population, but only approximate Normality of the *sampling distribution* of the mean. Simulations (and some theory) indicate the t-test may break down with skewed distributions, but even then it can be surprisingly robust.

@whuber, I surrender. You've just exceeded my depth of knowledge on this subject ;-}

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

rbm 5 years ago

https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot