How to perform an ANCOVA in R

  • I want to perform an ANCOVA analysis of data concerning density of plant epiphytes. At first, I would like to know if there is any difference in plant density between two slopes, one N and one S, but I have other data such as altitude, canopy openness and height of the host plant. I know that my covariate would have to be the two slopes (N and S). I built this model that runs in R and although I have no idea if it performs well. Also I would like to know what the difference is if I use the symbol + or *.

    model1 <- aov(density~slope+altitude+canopy+height)

    + will calculate main effects only, * will estimate interactions between factors connected with *. ANCOVA frameworks usually estimate only a main effect of the continuous factor, but interactions between all grouped factors.

  • The basic tool for this is lm; note that aov is a wrapper for lm.

    In particular, if you have some grouping variable (factor), $g$, and a continuous covariate $x$, the model y ~ x + g would fit a main effects ANCOVA model, while y ~ x * g would fit a model which includes interaction with the covariate. aov will take the same formulas.

    Pay particular attention to the Note in the help on aov.

    As for + vs *, russellpierce pretty much covers it, but I'd recommend you look at ?lm and ?formula and most especially section 11.1 of the manual An Introduction to R that comes with R (or you can find it online if you haven't figured out how to find it on your computer; most easily, this involves finding the "Help" pull down menu in either R or RStudio).

    suppose i have two group factors $g_1,g_2$, and two covariate $x_1,x_2$, with my model being $$y_{ij}=\mu+\alpha_i+\eta_j+x_{ij1}\gamma_1+x_{ij2}\gamma_2+\epsilon_{ij}$$ Does y~g_1+g_2+x_1+x_2 does the same trick? Do the F values obtained against x_1 and x_2 test $\gamma_1=0$ and $\gamma_2=0$ respectively?

    Not sure how I missed this. Yes. .... and if you want to test both at once, fit both with and without them and pass the fitted lm objects to `anova` (you'll soon see if you give them in the wrong order because some SS will be negative if you do)

  • I recommend getting and reading Discovering Statistics using R by Field. He has a nice section on ANCOVA.

    To run ANCOVA in R load the following packages:


    If you are using lm or aov (I use aov) make sure that you set the contrasts using the "contrasts" function before doing either aov or lm. R uses non-orthogonal contrasts by default which can mess everything up in an ANCOVA. If you want to set orthogonal contrasts use:

    contrasts(dataname$factorvariable)=contr.poly(# of levels, i.e. 3) 

    then run your model as

    model.1=aov(dv~covariate+factorvariable, data=dataname)

    To view the model use:

    Anova(model.1, type="III") 

    Make sure you use capital "A" Anova here and not anova. This will give results using type III SS.

    summary.lm(model.1) will give another summary and includes the R-sq. output.

    posth=glht(model.1, linfct=mcp(factorvariable="Tukey"))  ##gives the post-hoc Tukey analysis
    summary(posth) ##shows the output in a nice format.

    If you want to test for homogeneity of regression slopes you can also include an interaction term for the IV and covariate. That would be:

    model=aov(dv~covariate+IV+covariate:IV, data=dataname)

    If the interaction term is significant then you do not have homogeneity.

    Why do non-orthogonal contrasts mess everything up?

    To answer the question above about "why non-orthogonal contrasts mess everything up". The answer is that R defaults to non-orthogonal (i.e. difference between means) which can cause problems if you want to see the contribution of each IV separately. When we specify orthogonal contrasts we tell R that we want the SS for the IV's to be completely partitioned and non-overlapping. In this way we can see the variation attributed to each predictor cleanly and clearly. If you do not specify, R defaults to a more liberal approach to the contrast.

    Why the interest in type III SS?

  • Here is a complementary documentation of the procedure suggested by @Butorovich. In addition, my observation is that when the covariate is binary, using summary(lm.object) would give same IV estimate as generate by Anova(lm.object, type="III").

    It isn't clear that this is supposed to be an answer. Is it? If so, please edit to clarify. If it is a question, please ask by clicking the `ASK QUESTION` at the top & asking it there. Then we can help you properly.

    Agreed. The message has been revised as an (complementary) answer to the previous one.

  • We use Regression analysis to create models which describe the effect of variation in predictor variables on the response variable. Sometimes if we have a categorical variable with values like Yes/No or Male/Female etc. the simple regression analysis give multiple results for each value of the categorical variable. In such scenario, we can study the effect of the categorical variable by using it along with the predictor variable and comparing the regression lines for each level of the categorical variable. Such an analysis is termed as Analysis of Covariance also called as ANCOVA.

    Consider the R built-in data set mtcars. In it we observer that the field am represents the type of transmission (auto or manual). It is a categorical variable with values 0 and 1. The miles per gallon value (mpg) of a car can also depend on it besides the value of horse power (hp). We study the effect of the value of am on the regression between mpg and hp. It is done by using the aov() function followed by the anova() function to compare the multiple regressions.

    Input Data
    Create a data frame containing the fields mpg, hp and am from the data set mtcars. Here we take mpg as the response variable, hp as the predictor variable and am as the categorical variable.

    input <- mtcars[,c("am","mpg","hp")]

    When we execute above code, it produces following result:

                      am  mpg  hp
    Mazda RX4          1 21.0 110
    Mazda RX4 Wag      1 21.0 110
    Datsun 710         1 22.8  93
    Hornet 4 Drive     0 21.4 110
    Hornet Sportabout  0 18.7 175
    Valiant            0 18.1 105

    ANCOVA Analysis
    We create a regression model taking hp as the predictor variable and mpg as the response variable taking into account the interaction between am and hp.

    Model with interaction between categorical variable and predictor variable

    Create regression model1

    result1 <- aov(mpg~hp*am,data=mtcars)

    When we execute above code, it produces following result:

                Df Sum Sq Mean Sq F value   Pr(>F)    
    hp           1  678.4   678.4  77.391 1.50e-09 ***
    am           1  202.2   202.2  23.072 4.75e-05 ***
    hp:am        1    0.0     0.0   0.001    0.981    
    Residuals   28  245.4     8.8                     
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    This result shows that both horse power and transmission type has significant effect on miles per gallon as the p-value in both cases is less than 0.05. But the interaction between these two variables is not significant as the p-value is more than 0.05.

    Model without interaction between categorical variable and predictor variable

    Create the regression model2

    result2 <- aov(mpg~hp+am,data=mtcars)

    When we execute above code, it produces following result:

                Df Sum Sq Mean Sq F value   Pr(>F)    
    hp           1  678.4   678.4   80.15 7.63e-10 ***
    am           1  202.2   202.2   23.89 3.46e-05 ***
    Residuals   29  245.4     8.5                     
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    This result shows that both horse power and transmission type has significant effect on miles per gallon as the p-value in both cases is less than 0.05.

    Comparing two models
    Now we can compare the two models to conclude if the interaction of the variables is truly insignificant. For this we use the anova() function.

     Model 1: mpg ~ hp * am
     Model 2: mpg ~ hp + am
       Res.Df    RSS Df  Sum of Sq     F Pr(>F)
     1     28 245.43                           
     2     29 245.44 -1 -0.0052515 6e-04 0.9806

    As the p-value is greater than 0.05 we conclude that the interaction between horse power and transmission type is not significant. So the mileage per gallon will depend in a similar manner on the horse power of the car in both auto and manual transmission mode.

    So which came first, this answer or this post on tutorials point? Should this answer be deleted as plagiarism? Or did tutorials point just copy from here?

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM