How to perform a test using R to see if data follows normal distribution
I have a data set with following structure:
a word | number of occurrence of a word in a document | a document id
How can I perform a test for normal distribution in R? Probably it is an easy question but I am a R newbie.
@Skarab Maybe I'm totally off, but wouldn't you expect that the frequency of any word will be inversely proportional to its rank in the frequency table of words, according to Zipf's law (http://j.mp/9er2lv)? In this case, check out the `zipfR` package.
I agree with @chl - it would be minor miracle if your data was normally distributed. Perhaps another question about what you want to do with the data would be worthwhile. Don't reinvent the wheel!
How could your data be distributed according to a model that gives non zero probability to negative occurrence?
I want to estimate if the huge result of the Information Extraction is correct. I want to check if the distribution of the entities found in the text follows my expectations (I know the domain and the text corpus).
If I understand your question correctly, then to test if word occurrences in a set of documents follows a Normal distribution you can just use a shapiro-Wilk test and some qqplots. For example,
## Generate two data sets ## First Normal, second from a t-distribution words1 = rnorm(100); words2 = rt(100, df=3) ## Have a look at the densities plot(density(words1));plot(density(words2)) ## Perform the test shapiro.test(words1); shapiro.test(words2) ## Plot using a qqplot qqnorm(words1);qqline(words1, col = 2) qqnorm(words2);qqline(words2, col = 2)
The qqplot commands give:
You can see that the second data set is clearly not Normal by the heavy tails (More Info).
In the Shapiro-Walk normality test, the p-value is large for the first data set (>.9) but very small for the second data set (<.01). This will lead you to reject the null hypothesis for the second.
I think the plotted points should lie on the I-III quadrant bisector as closer as they draw a normal distribution.
@HermanToothrot it is not Normal when looking at the second plot as there is a very large divergence in the tail values. The QQ plot is a graph of the theoretical quantile (if it was normal) verses the sample quantlie (from the data). If the sample data is normal we expect the observations to be close to line, as they are for the first plot. Also note the very difference scale on the y axis for those plots.