How to summarize data by group in R?

  • I have R data frame like this:

            age group
    1   23.0883     1
    2   25.8344     1
    3   29.4648     1
    4   32.7858     2
    5   33.6372     1
    6   34.9350     1
    7   35.2115     2
    8   35.2115     2
    9   35.2115     2
    10  36.7803     1
    ...
    

    I need to get data frame in the following form:

    group mean     sd
    1     34.5     5.6
    2     32.3     4.2
    ...
    

    Group number may vary, but their names and quantity could be obtained by calling levels(factor(data$group))

    What manipulations should be done with the data to get the result?

    the commas in the result data frame mean something special, or is it just the decimal point?

    @mpiktas Thank you for noting. Corrected. These were locale issues (I am russian) - we we use comma for decimal separation.

    I suspected that. All of the Europe uses comma except the British.

    Despite not being British, I prefer dot for decimal separator.

    See `aggregate`, `tapply`, and then http://stackoverflow.com for any subsequent coding questions of this type.

  • mpiktas

    mpiktas Correct answer

    10 years ago

    Here is the plyr one line variant using ddply:

    dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T))
    ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))
    

    Here is another one line variant using new package data.table.

    dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T)))
    dt <- data.table(dtf)
    dt[,list(mean=mean(age),sd=sd(age)),by=group]
    

    This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:

    > system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age)))
    utilisateur     système      écoulé 
          0.513       0.180       0.692 
    > system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group])
    utilisateur     système      écoulé 
          0.087       0.018       0.103 
    

    Further savings are possible if we use setkey:

    > setkey(dt,group)
    > system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group])
    utilisateur     système      écoulé 
          0.040       0.007       0.048 
    

    @chl, it gave me a chance to try out this new **data.table** package. It looks really promising.

    +6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds.

    Simple typo: I think you meant `dt <- data.table(dtf)` instead of `dt <- data.table(dt)` in the second code block. That way, you are creating the data table from a data frame instead of from the `dt` function from the `stats` package. I tried editing it, but I cannot do edits under six characters.

    In my (not humble in this case) opinion `data.table` is the best way to aggregate data and this answer is great, but still only scratches the surface. Aside from being syntactically superior, it's also extremely flexible and has many advanced features that involve joins and internal mechanics. Check out the FAQ, github page, or course for more info.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM