How to perform a test using R to see if data follows normal distribution

  • I have a data set with following structure:

    a word | number of occurrence of a word in a document | a document id 
    

    How can I perform a test for normal distribution in R? Probably it is an easy question but I am a R newbie.

    @Skarab Maybe I'm totally off, but wouldn't you expect that the frequency of any word will be inversely proportional to its rank in the frequency table of words, according to Zipf's law (http://j.mp/9er2lv)? In this case, check out the `zipfR` package.

    I agree with @chl - it would be minor miracle if your data was normally distributed. Perhaps another question about what you want to do with the data would be worthwhile. Don't reinvent the wheel!

    How could your data be distributed according to a model that gives non zero probability to negative occurrence?

    What is the reason for doing this test?

    I want to estimate if the huge result of the Information Extraction is correct. I want to check if the distribution of the entities found in the text follows my expectations (I know the domain and the text corpus).

    @chi I needed to check data related to Information Extraction for this purpose I needed the test. Of course, the word frequency follows Zipf.

  • If I understand your question correctly, then to test if word occurrences in a set of documents follows a Normal distribution you can just use a shapiro-Wilk test and some qqplots. For example,

    ## Generate two data sets
    ## First Normal, second from a t-distribution
    words1 = rnorm(100); words2 = rt(100, df=3)
    
    ## Have a look at the densities
    plot(density(words1));plot(density(words2))
    
    ## Perform the test
    shapiro.test(words1); shapiro.test(words2)
    
    ## Plot using a qqplot
    qqnorm(words1);qqline(words1, col = 2)
    qqnorm(words2);qqline(words2, col = 2)
    

    The qqplot commands give: alt text

    You can see that the second data set is clearly not Normal by the heavy tails (More Info).

    In the Shapiro-Walk normality test, the p-value is large for the first data set (>.9) but very small for the second data set (<.01). This will lead you to reject the null hypothesis for the second.

    Why is it clearly not Normal?

    I think the plotted points should lie on the I-III quadrant bisector as closer as they draw a normal distribution.

    More generally (mean != 0), the `qqline` shall have 1 slope and *mu* intercept.

    @HermanToothrot it is not Normal when looking at the second plot as there is a very large divergence in the tail values. The QQ plot is a graph of the theoretical quantile (if it was normal) verses the sample quantlie (from the data). If the sample data is normal we expect the observations to be close to line, as they are for the first plot. Also note the very difference scale on the y axis for those plots.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM