Degrees of freedom for Chi-squared test

  • I am facing the following dilemma. I am aware of how to handle the one-sided Chi-squared distribution, but I am falling victim to how to handle degrees of freedom. Let me clarify with an example what I mean.

    I have the following obseverd and expected values:

    [Observed Data]
    
    #Periods      CountryI   CountryII     CountryIII
    #(1900-1950)     100      150            20
    #(1951-2000)     59       160            50
    
    [Expected DATA]
    
    #Periods   country I     Country II       CountryIII        
    #(1900-1950)  118.4         52                40
    #(1951-2000)   80.5         90                25
    

    My question is: Since this is a one sided-Chi square test, are the degrees of freedom counted by the formula: (columns-1)(rows-1), in which case I would have $(6-1)(2-1) = 5$?

    Or is that really just country1 country2 country3 that matters, so that d.f. would be 3-1=2?

    Because d.f. is usually defined as the terms for the chi squared = 6, where we usually subtract 1 from it.

    Please help me out with this one.

    Could you please clarify how did you get the expected values, and what hypothesis are you trying to test. The correct degrees of freedom depends on that. Most answers assume that you are interested in independence (is the effect of period the same for each country?).

  • How many variables are present in your cross-classification will determine the degrees of freedom of your $\chi^2$-test. In your case, your are actually cross-classifying two variables (period and country) in a 2-by-3 table.

    So the dof are $(2-1)\times (3-1)=2$ (see e.g., Pearson's chi-square test for justification of its computation). I don't see where you got the $6$ in your first formula, and your expected frequencies are not correct, unless I misunderstood your dataset.

    A quick check in R gives me:

    > my.tab <- matrix(c(100, 59, 150, 160, 20, 50), nc=3)
    > my.tab
         [,1] [,2] [,3]
    [1,]  100  150   20
    [2,]   59  160   50
    > chisq.test(my.tab)
    
        Pearson's Chi-squared test
    
    data:  my.tab 
    X-squared = 23.7503, df = 2, p-value = 6.961e-06
    
    > chisq.test(my.tab)$expected
            [,1]     [,2]     [,3]
    [1,] 79.6475 155.2876 35.06494
    [2,] 79.3525 154.7124 34.93506
    
  • Degrees of Freedom are (r-1)(c-1).

    You have

    2 rows : 1900-1950 and 1950-1999

    3 columns: CountryI CountryII CountryIII

    Thus (2-1)(3-1) = 2

    The actual product of r x c should = n (total # of observations) which is six. However, this is not used in your calculation of the df.

    Edit: If you were doing an 'Goodness of Fit' then yes, it would be n-1 but you have a contigency table (r x c) where r or c not equal to 1 so you have to use the (r-1)(c-1)

    Edit #2 for dimbo (I can't comment): Expected values should be calculated by (row total)(column total) / (total # of observations) : Thus the expected for r1,c1 position is (270)(159) / (539) which gives the values chi gave you.

    Edit #: SAS code confirming Chi

    data question;
     do a=1 to 2;
       do b=1 to 3;
         input var @@;
         output;
       end;
      end;
    datalines;
    100 150 20
    59  160 50
    ;
    run;
    
    proc freq data = question;
    weight var;
    tables a*b /
     chisq expected norow nocol;
    run;
    

    Output

          Frequency|
          Expected |
          Percent  |       1|       2|       3|  Total
           --------+--------+--------+--------+
                 1 |    100 |    150 |     20 |    270
                   | 79.647 | 155.29 | 35.065 |
                   |  18.55 |  27.83 |   3.71 |  50.09
          ---------+--------+--------+--------+
                 2 |     59 |    160 |     50 |    269
                   | 79.353 | 154.71 | 34.935 |
                   |  10.95 |  29.68 |   9.28 |  49.91
          ---------+--------+--------+--------+
          Total         159      310       70      539
                      29.50    57.51    12.99   100.00
    
    
      Statistics for Table of a by b
    
      Statistic                     DF       Value      Prob
      ------------------------------------------------------
      Chi-Square                     2     23.7503    <.0001
      Likelihood Ratio Chi-Square    2     24.2964    <.0001
      Mantel-Haenszel Chi-Square     1     23.3700    <.0001
      Chi Coefficient                       0.2099
      Contingency Coefficient               0.2054
      Cramer's V                            0.2099
    
                                                      Sample Size = 539
    

    Nothing in the question is indicative of a GOF test because it would require the OP to supply theoretical proportions rather than observed counts (and we couldn't get such a table of expected counts this way).

    Yup, I was saying that his way of calcualting the expected by making everything into a table with 1 row or 1 column and then just using n-1 was more like the process for a GOF test.

  • Wait a minute, I think Sandra means 5 rather than 6.

    Maybe chl can correct me on this ... but I think it should be rite. If we take the definition that $\chi^2$ is evaluated as follows,

    $$\chi^2= \sum_{i=1}^{\#Rows}(observed_i - expected_i)^2/expected_i $$

    and arrange the data as follows:

    Observed[O]| Expected[E] | (O-E)^2/E
    100          118.4         
    150           52
     20           40
     59           80.5
    160           90
     50           25
    

    Thus, the total number of terms for calculating $\chi^2$ is 6 (as we are adding the final column of terms together which has 6 rows. As by definition, we have d.f.= no of rows or expected frequencies - 1. Thus we obtain 5.

    by the way why are u saying that Sandra's expected values are incorrect ?

    mpiktas, can u explain please how i would get the p values using the definition which i stated above -- step by step ?

    I still dont really understand the other answers. Someone who actually is in a position, can use the definition i provided above, with a step by step procedure to calculate x^2 value and then the p-value.

    For testing the independence of the two variables, the summation in your formula should be over the number of observed cells, and not only the number of rows. What you describe corresponds to a GOF test, which is not obvious from the question. As we are about to speculate about that, we could imagine variation around RxC tables whereby one of the the variable might be treated as an ordinal outcome, like in this question, but I believe we have to wait for further clarifications :)

    Reflect on how you computed the expected values: you had to use two row means and three column means, which is *five* numbers. However, there is one (linear) relation among them (because they are both related to the grand mean). That means you used 5-1 = 4 degrees of freedom in computing the expected value. Because there are 6 observations, you only have 6 - 4 = 2 degrees of freedom left. Note how your method of placing all the data into a single column obscures the information about degrees of freedom, whereas the standard tabular layout implicitly gives this information.

  • The degrees of freedom for chi square test in contingency table is determined by the number of 'expected observations' estimated independently. In your 2x3 table since row and column totals are already known, therefore you need estimate just two expected observations using formula (row total)*(column total)/N. Remaining expected observations can be found by subtraction from row or column total. for example if you estimate the first two observations of the first row then third observation of the first row can be found be subtracting these two estimated observations from the first row total and once the first row is known you can easily find the second row expected observations as the column totals are also already known.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM