#### [SOLVED] How to sum a variable by group?

Let's say I have two columns of data. The first contains categories such as "First", "Second", "Third", etc. The second has numbers which represent the number of times I saw "First".

For example:

``````Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3
``````

I want to sort the data by Category and sum the Frequencies:

``````Category     Frequency
First        30
Second       5
Third        34
``````

How would I do this in R?

#### @Csd 2018-11-18 14:11:14

You could use the function `group.sum` from package Rfast.

``````Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
``````

Rfast has many group functions and `group.sum` is one of them.

#### @digEmAll 2018-12-10 16:55:13

I find `ave` very helpful (and efficient) when you need to apply different aggregation functions on different columns (and you must/want to stick on base R) :

e.g.

Given this input :

``````DF <-
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
Categ2=factor(c('X','Y','X','X','X','Y','Y')),
Samples=c(1,2,4,3,5,6,7),
Freq=c(10,30,45,55,80,65,50))

> DF
Categ1 Categ2 Samples Freq
1      A      X       1   10
2      A      Y       2   30
3      B      X       4   45
4      B      X       3   55
5      A      X       5   80
6      B      Y       6   65
7      A      Y       7   50
``````

we want to group by `Categ1` and `Categ2` and compute the sum of `Samples` and mean of `Freq`.
Here's a possible solution using `ave` :

``````# create a copy of DF (only the grouping columns)
DF2 <- DF[,c('Categ1','Categ2')]

# add sum of Samples by Categ1,Categ2 to DF2
# (ave repeats the sum of the group for each row in the same group)
DF2\$GroupTotSamples <- ave(DF\$Samples,DF2,FUN=sum)

# add mean of Freq by Categ1,Categ2 to DF2
# (ave repeats the mean of the group for each row in the same group)
DF2\$GroupAvgFreq <- ave(DF\$Freq,DF2,FUN=mean)

# remove the duplicates (keep only one row for each group)
DF2 <- DF2[!duplicated(DF2),]
``````

Result :

``````> DF2
Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65
``````

#### @dmca 2018-12-03 00:53:58

The recently added `dplyr::tally()` now makes this easier than ever:

``````tally(x, Category)

Category     n
First        30
Second       5
Third        34
``````

#### @rcs 2009-11-02 12:52:46

Using `aggregate`:

``````aggregate(x\$Frequency, by=list(Category=x\$Category), FUN=sum)
Category  x
1    First 30
2   Second  5
3    Third 34
``````

In the example above, multiple dimensions can be specified in the `list`. Multiple aggregated metrics of the same data type can be incorporated via `cbind`:

``````aggregate(cbind(x\$Frequency, x\$Metric2, x\$Metric3) ...
``````

(embedding @thelatemail comment), `aggregate` has a formula interface too

``````aggregate(Frequency ~ Category, x, sum)
``````

Or if you want to aggregate multiple columns, you could use the `.` notation (works for one column too)

``````aggregate(. ~ Category, x, sum)
``````

or `tapply`:

``````tapply(x\$Frequency, x\$Category, FUN=sum)
First Second  Third
30      5     34
``````

Using this data:

``````x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
``````

#### @r2evans 2016-12-19 04:35:12

@AndrewMcKinlay, R uses the tilde to define symbolic formulae, for statistics and other functions. It can be interpreted as "model Frequency by Category" or "Frequency depending on Category". Not all languages use a special operator to define a symbolic function, as done in R here. Perhaps with that "natural-language interpretation" of the tilde operator, it becomes more meaningful (and even intuitive). I personally find this symbolic formula representation better than some of the more verbose alternatives.

#### @Dodecaphone 2018-10-28 10:42:06

Being new to R (and asking the same sorts of questions as the OP), I would benefit from some more detail of the syntax behind each alternative. For instance, if I have a larger source table and want to subselect just two dimensions plus summed metrics, can I adapt any of these methods? Hard to tell.

#### @Grant Shannon 2018-02-25 15:43:56

using `cast` instead of `recast` (note `'Frequency'` is now `'value'`)

``````df  <- data.frame(Category = c("First","First","First","Second","Third","Third","Second")
, value = c(10,15,5,2,14,20,3))

install.packages("reshape")

result<-cast(df, Category ~ . ,fun.aggregate=sum)
``````

to get:

``````Category (all)
First     30
Second    5
Third     34
``````

#### @asieira 2013-09-08 17:50:16

The answer provided by rcs works and is simple. However, if you are handling larger datasets and need a performance boost there is a faster alternative:

``````library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed
# 0.008     0.001     0.009
``````

Let's compare that to the same thing using data.frame and the above above:

``````data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data\$Frequency, by=list(Category=data\$Category), FUN=sum))
# user    system   elapsed
# 0.008     0.000     0.015
``````

And if you want to keep the column this is the syntax:

``````data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
``````

The difference will become more noticeable with larger datasets, as the code below demonstrates:

``````data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed
# 0.055     0.004     0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data\$Frequency, by=list(Category=data\$Category), FUN=sum) )
# user    system   elapsed
# 0.287     0.010     0.296
``````

For multiple aggregations, you can combine `lapply` and `.SD` as follows

``````data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
``````

#### @Matt Dowle 2013-09-09 10:05:06

+1 But 0.296 vs 0.059 isn't particularly impressive. The data size needs to be much bigger than 300k rows, and with more than 3 groups, for data.table to shine. We'll try and support more than 2 billion rows soon for example, since some data.table users have 250GB of RAM and GNU R now supports length > 2^31.

#### @asieira 2013-10-23 23:22:16

True. Turns out I don't have all that RAM though, and was simply trying to provide some evidence of data.table's superior performance. I'm sure the difference would be even larger with more data.

#### @zazu 2015-11-14 19:10:11

I had 7 mil observations dplyr took .3 seconds and aggregate() took 22 seconds to complete the operation. I was going to post it on this topic and you beat me to it!

#### @Stophface 2017-02-22 11:47:22

There is a even shorter way to write this `data[, sum(Frequency), by = Category]`. You could use `.N` which substitutes the `sum()` function. `data[, .N, by = Category]`. Here is a useful cheatsheet: s3.amazonaws.com/assets.datacamp.com/img/blog/…

#### @asieira 2017-03-01 13:26:41

Using .N would be equivalent to sum(Frequency) only if all the values in the Frequency column were equal to 1, because .N counts the number of rows in each aggregated set (.SD). And that is not the case here.

#### @docendo discimus 2014-12-03 08:02:19

More recently, you can also use the dplyr package for that purpose:

``````library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34
``````

Or, for multiple summary columns (works with one column too):

``````x %>%
group_by(Category) %>%
summarise_each(funs(sum))
``````

Update for dplyr >= 0.5: `summarise_each` has been replaced by `summarise_all`, `summarise_at` and `summarise_if` family of functions in dplyr.

Or, if you have multiple columns to group by, you can specify all of them in the `group_by` separated with commas:

``````mtcars %>%
group_by(cyl, gear) %>%                            # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns
``````

For more information, including the `%>%` operator, see the introduction to dplyr.

#### @asieira 2015-01-23 14:35:58

How fast is it when compared to the data.table and aggregate alternatives presented in other answers?

#### @docendo discimus 2015-01-23 14:50:59

@asieira, Which is fastest and how big the difference (or if the difference is noticeable) is will always depend on your data size. Typically, for large data sets, for example some GB, data.table will most likely be fastest. On smaller data size, data.table and dplyr are often close, also depending on the number of groups. Both data,table and dplyr will be quite a lot faster than base functions, however (can well be 100-1000 times faster for some operations). Also see here

#### @joemienko 2016-05-17 12:12:56

While I have recently become a convert to `dplyr` for most of these types of operations, the `sqldf` package is still really nice (and IMHO more readable) for some things.

Here is an example of how this question can be answered with `sqldf`

``````x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))

sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34
``````

#### @David Arenburg 2015-09-10 13:36:01

Several years later, just to add another simple base R solution that isn't present here for some reason- `xtabs`

``````xtabs(Frequency ~ Category, df)
# Category
# First Second  Third
#    30      5     34
``````

Or if want a `data.frame` back

``````as.data.frame(xtabs(Frequency ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34
``````

#### @dalloliogm 2009-11-02 12:15:52

Just to add a third option:

``````require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
``````

EDIT: this is a very old answer. Now I would recommend the use of group_by and summarise from dplyr, as in @docendo answer.

#### @Shane 2009-11-02 17:42:07

This is somewhat related to this question.

You can also just use the by() function:

``````x2 <- by(x\$Frequency, x\$Category, sum)
do.call(rbind,as.list(x2))
``````

Those other packages (plyr, reshape) have the benefit of returning a data.frame, but it's worth being familiar with by() since it's a base function.

#### @learnr 2009-11-02 09:44:34

``````library(plyr)
ddply(tbl, .(Category), summarise, sum = sum(Frequency))
``````

#### @Rob Hyndman 2009-11-02 09:38:41

If `x` is a dataframe with your data, then the following will do what you want:

``````require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
``````

### [SOLVED] How to make a great R reproducible example

• 2011-05-11 11:12:02
• Andrie
• 244214 View
• 2481 Score
• Tags:   r r-faq