How to summarize data by group in R?
I have R data frame like this:
age group 1 23.0883 1 2 25.8344 1 3 29.4648 1 4 32.7858 2 5 33.6372 1 6 34.9350 1 7 35.2115 2 8 35.2115 2 9 35.2115 2 10 36.7803 1 ...
I need to get data frame in the following form:
group mean sd 1 34.5 5.6 2 32.3 4.2 ...
Group number may vary, but their names and quantity could be obtained by calling
What manipulations should be done with the data to get the result?
the commas in the result data frame mean something special, or is it just the decimal point?
@mpiktas Thank you for noting. Corrected. These were locale issues (I am russian) - we we use comma for decimal separation.
Here is the plyr one line variant using ddply:
dt <- data.frame(age=rchisq(20,10),group=sample(1:2,20,rep=T)) ddply(dt,~group,summarise,mean=mean(age),sd=sd(age))
Here is another one line variant using new package data.table.
dtf <- data.frame(age=rchisq(100000,10),group=factor(sample(1:10,100000,rep=T))) dt <- data.table(dtf) dt[,list(mean=mean(age),sd=sd(age)),by=group]
This one is faster, though this is noticeable only on table with 100k rows. Timings on my Macbook Pro with 2.53 Ghz Core 2 Duo processor and R 2.11.1:
> system.time(aa <- ddply(dtf,~group,summarise,mean=mean(age),sd=sd(age))) utilisateur système écoulé 0.513 0.180 0.692 > system.time(aa <- dt[,list(mean=mean(age),sd=sd(age)),by=group]) utilisateur système écoulé 0.087 0.018 0.103
Further savings are possible if we use
> setkey(dt,group) > system.time(dt[,list(mean=mean(age),sd=sd(age)),by=group]) utilisateur système écoulé 0.040 0.007 0.048
@chl, it gave me a chance to try out this new **data.table** package. It looks really promising.
+6000 for data.table. It really is so much faster than ddply, even for me on datasets smaller than 100k (I have one with just 20k rows). Must be something to do with the functions I am applying, but ddply will take minutes and data.table a few seconds.
Simple typo: I think you meant `dt <- data.table(dtf)` instead of `dt <- data.table(dt)` in the second code block. That way, you are creating the data table from a data frame instead of from the `dt` function from the `stats` package. I tried editing it, but I cannot do edits under six characters.
In my (not humble in this case) opinion `data.table` is the best way to aggregate data and this answer is great, but still only scratches the surface. Aside from being syntactically superior, it's also extremely flexible and has many advanced features that involve joins and internal mechanics. Check out the FAQ, github page, or course for more info.