By user5243421

2009-11-02 09:01:28 8 Comments

Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".

For example:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3

I want to sort the data by Category and sum the Frequencies:

Category     Frequency
First        30
Second       5
Third        34

How would I do this in R?


@docendo discimus 2014-12-03 08:02:19

More recently, you can also use the dplyr package for that purpose:

x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34

Or, for multiple summary columns (works with one column too):

x %>% 
  group_by(Category) %>% 

Update for dplyr >= 0.5: summarise_each has been replaced by summarise_all, summarise_at and summarise_if family of functions in dplyr.

Or, if you have multiple columns to group by, you can specify all of them in the group_by separated with commas:

mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

For more information, including the %>% operator, see the introduction to dplyr.

@asieira 2015-01-23 14:35:58

How fast is it when compared to the data.table and aggregate alternatives presented in other answers?

@docendo discimus 2015-01-23 14:50:59

@asieira, Which is fastest and how big the difference (or if the difference is noticeable) is will always depend on your data size. Typically, for large data sets, for example some GB, data.table will most likely be fastest. On smaller data size, data.table and dplyr are often close, also depending on the number of groups. Both data,table and dplyr will be quite a lot faster than base functions, however (can well be 100-1000 times faster for some operations). Also see here

@dalloliogm 2009-11-02 12:15:52

Just to add a third option:

summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)

EDIT: this is a very old answer. Now I would recommend the use of group_by and summarise from dplyr, as in @docendo answer.

@David Arenburg 2015-09-10 13:36:01

Several years later, just to add another simple base R solution that isn't present here for some reason- xtabs

xtabs(Frequency ~ Category, df)
# Category
# First Second  Third 
#    30      5     34 

Or if you want a data.frame back ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34

@Csd 2018-11-18 14:11:14

You could use the function group.sum from package Rfast.

Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34

Rfast has many group functions and group.sum is one of them.

@digEmAll 2018-12-10 16:55:13

I find ave very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :


Given this input :

DF <-                

> DF
  Categ1 Categ2 Samples Freq
1      A      X       1   10
2      A      Y       2   30
3      B      X       4   45
4      B      X       3   55
5      A      X       5   80
6      B      Y       6   65
7      A      Y       7   50

we want to group by Categ1 and Categ2 and compute the sum of Samples and mean of Freq.
Here's a possible solution using ave :

# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]

# add sum of Samples by Categ1,Categ2 to DF2 
# (ave repeats the sum of the group for each row in the same group)
DF2$GroupTotSamples <- ave(DF$Samples,DF2,FUN=sum)

# add mean of Freq by Categ1,Categ2 to DF2 
# (ave repeats the mean of the group for each row in the same group)
DF2$GroupAvgFreq <- ave(DF$Freq,DF2,FUN=mean)

# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]

Result :

> DF2
  Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65

@dmca 2018-12-03 00:53:58

The recently added dplyr::tally() now makes this easier than ever:

tally(x, Category)

Category     n
First        30
Second       5
Third        34

@rcs 2009-11-02 12:52:46

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 

@r2evans 2016-12-19 04:35:12

@AndrewMcKinlay, R uses the tilde to define symbolic formulae, for statistics and other functions. It can be interpreted as "model Frequency by Category" or "Frequency depending on Category". Not all languages use a special operator to define a symbolic function, as done in R here. Perhaps with that "natural-language interpretation" of the tilde operator, it becomes more meaningful (and even intuitive). I personally find this symbolic formula representation better than some of the more verbose alternatives.

@Dodecaphone 2018-10-28 10:42:06

Being new to R (and asking the same sorts of questions as the OP), I would benefit from some more detail of the syntax behind each alternative. For instance, if I have a larger source table and want to subselect just two dimensions plus summed metrics, can I adapt any of these methods? Hard to tell.

@Grant Shannon 2018-02-25 15:43:56

using cast instead of recast (note 'Frequency' is now 'value')

df  <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
                  , value = c(10,15,5,2,14,20,3))


result<-cast(df, Category ~ . ,fun.aggregate=sum)

to get:

Category (all)
First     30
Second    5
Third     34

@asieira 2013-09-08 17:50:16

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 

Let's compare that to the same thing using data.frame and the above above:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 

And if you want to keep the column this is the syntax:

#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

The difference will become more noticeable with larger datasets, as the code below demonstrates:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 

For multiple aggregations, you can combine lapply and .SD as follows

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

@Matt Dowle 2013-09-09 10:05:06

+1 But 0.296 vs 0.059 isn't particularly impressive. The data size needs to be much bigger than 300k rows, and with more than 3 groups, for data.table to shine. We'll try and support more than 2 billion rows soon for example, since some data.table users have 250GB of RAM and GNU R now supports length > 2^31.

@asieira 2013-10-23 23:22:16

True. Turns out I don't have all that RAM though, and was simply trying to provide some evidence of data.table's superior performance. I'm sure the difference would be even larger with more data.

@zazu 2015-11-14 19:10:11

I had 7 mil observations dplyr took .3 seconds and aggregate() took 22 seconds to complete the operation. I was going to post it on this topic and you beat me to it!

@Stophface 2017-02-22 11:47:22

There is a even shorter way to write this data[, sum(Frequency), by = Category]. You could use .N which substitutes the sum() function. data[, .N, by = Category]. Here is a useful cheatsheet:…

@asieira 2017-03-01 13:26:41

Using .N would be equivalent to sum(Frequency) only if all the values in the Frequency column were equal to 1, because .N counts the number of rows in each aggregated set (.SD). And that is not the case here.

@joemienko 2016-05-17 12:12:56

While I have recently become a convert to dplyr for most of these types of operations, the sqldf package is still really nice (and IMHO more readable) for some things.

Here is an example of how this question can be answered with sqldf

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 

          ,sum(Frequency) as Frequency 
       from x 
       group by 

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34

@Shane 2009-11-02 17:42:07

This is somewhat related to this question.

You can also just use the by() function:

x2 <- by(x$Frequency, x$Category, sum),as.list(x2))

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.

@learnr 2009-11-02 09:44:34

ddply(tbl, .(Category), summarise, sum = sum(Frequency))

@Rob Hyndman 2009-11-02 09:38:41

If x is a dataframe with your data, then the following will do what you want:

recast(x, Category ~ ., fun.aggregate=sum)

Related Questions

Sponsored Content

13 Answered Questions

[SOLVED] How to join (merge) data frames (inner, outer, left, right)

34 Answered Questions

[SOLVED] How do I sort a dictionary by value?

18 Answered Questions

[SOLVED] How do you sort a dictionary by value?

26 Answered Questions

[SOLVED] How do I sort an NSMutableArray with custom objects in it?

19 Answered Questions

[SOLVED] How to sort a dataframe by multiple column(s)

23 Answered Questions

[SOLVED] How to make a great R reproducible example

  • 2011-05-11 11:12:02
  • Andrie
  • 253550 View
  • 2476 Score
  • 23 Answer
  • Tags:   r r-faq

3 Answered Questions

[SOLVED] data.table vs dplyr: can one do something well the other can't or does poorly?

  • 2014-01-29 15:21:45
  • BrodieG
  • 101177 View
  • 666 Score
  • 3 Answer
  • Tags:   r data.table dplyr

9 Answered Questions

[SOLVED] Grouping functions (tapply, by, aggregate) and the *apply family

36 Answered Questions

[SOLVED] How to pair socks from a pile efficiently?

9 Answered Questions

[SOLVED] List<T> OrderBy Alphabetical Order

Sponsored Content