By MikeTP


2012-03-15 15:44:54 8 Comments

From a data frame, is there a easy way to aggregate (sum, mean, max et c) multiple variables simultaneously?

Below are some sample data:

library(lubridate)
days = 365*2
date = seq(as.Date("2000-01-01"), length = days, by = "day")
year = year(date)
month = month(date)
x1 = cumsum(rnorm(days, 0.05)) 
x2 = cumsum(rnorm(days, 0.05))
df1 = data.frame(date, year, month, x1, x2)

I would like to simultaneously aggregate the x1 and x2 variables from the df2 data frame by year and month. The following code aggregates the x1 variable, but is it also possible to simultaneously aggregate the x2 variable?

### aggregate variables by year month
df2=aggregate(x1 ~ year+month, data=df1, sum, na.rm=TRUE)
head(df2)

Any suggestions would be greatly appreciated.

6 comments

@Jozef 2018-12-27 15:18:36

Interestingly, base R aggregate's data.frame method is not showcased here, above the formula interface is used, so for completeness:

aggregate(
  x = df1[c("x1", "x2")],
  by = df1[c("year", "month")],
  FUN = sum, na.rm = TRUE
)

More generic use of aggregate's data.frame method:

Since we are providing a

  • data.frame as x and
  • a list (data.frame is also a list) as by, this is very useful if we need to use it in a dynamic manner, e.g. using other columns to be aggregated and to aggregate by is very simple
  • also with custom-made aggregation functions

For example like so:

colsToAggregate <- c("x1")
aggregateBy <- c("year", "month")
dummyaggfun <- function(v, na.rm = TRUE) {
  c(sum = sum(v, na.rm = na.rm), mean = mean(v, na.rm = na.rm))
}

aggregate(df1[colsToAggregate], by = df1[aggregateBy], FUN = dummyaggfun)

@britt 2018-08-15 16:22:53

Late to the party, but recently found another way to get the summary statistics.

library(psych) describe(data)

Will output: mean, min, max, standard deviation, n, standard error, kurtosis, skewness, median, and range for each variable.

@EDi 2012-03-15 15:56:53

Where is this year() function from?

You could also use the reshape2 package for this task:

require(reshape2)
df_melt <- melt(df1, id = c("date", "year", "month"))
dcast(df_melt, year + month ~ variable, sum)
#  year month         x1           x2
1  2000     1  -80.83405 -224.9540159
2  2000     2 -223.76331 -288.2418017
3  2000     3 -188.83930 -481.5601913
4  2000     4 -197.47797 -473.7137420
5  2000     5 -259.07928 -372.4563522

@Jaap 2016-05-13 06:17:18

The recast function (also from reshape2) integrates the melt and dcast function in one go for tasks like this: recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

@Jaap 2015-10-16 10:19:12

With the dplyr package, you can use summarise_all, summarise_at or summarise_if functions to aggregate multiple variables simultaneously. For the example dataset you can do this as follows:

library(dplyr)
# summarising all non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_all(sum)

# summarising a specific set of non-grouping variables
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(x1, x2), sum)
df2 <- df1 %>% group_by(year, month) %>% summarise_at(vars(-date), sum)

# summarising a specific set of non-grouping variables based on condition (class)
df2 <- df1 %>% group_by(year, month) %>% summarise_if(is.numeric, sum)

The result of the latter two options:

    year month        x1         x2
   <dbl> <dbl>     <dbl>      <dbl>
1   2000     1 -73.58134  -92.78595
2   2000     2 -57.81334 -152.36983
3   2000     3 122.68758  153.55243
4   2000     4 450.24980  285.56374
5   2000     5 678.37867  384.42888
6   2000     6 792.68696  530.28694
7   2000     7 908.58795  452.31222
8   2000     8 710.69928  719.35225
9   2000     9 725.06079  914.93687
10  2000    10 770.60304  863.39337
# ... with 14 more rows

Note: summarise_each is deprecated in favor of summarise_all, summarise_at and summarise_if.


As mentioned in my comment above, you can also use the recast function from the reshape2-package:

library(reshape2)
recast(df1, year + month ~ variable, sum, id.var = c("date", "year", "month"))

which will give you the same result.

@Andrie 2012-03-15 15:50:01

Yes, in your formula, you can cbind the numeric variables to be aggregated:

aggregate(cbind(x1, x2) ~ year + month, data = df1, sum, na.rm = TRUE)
   year month         x1          x2
1  2000     1   7.862002   -7.469298
2  2001     1 276.758209  474.384252
3  2000     2  13.122369 -128.122613
...
23 2000    12  63.436507  449.794454
24 2001    12 999.472226  922.726589

See ?aggregate, the formula argument and the examples.

@pdb 2015-11-13 05:29:28

Is it possible for the cbind to use dynamic variables?

@pdb 2015-11-13 06:19:09

It's worth noting that when any of the variables that is in the cbind has an NA the row will be dropped for every variable in the cbind. This is not the behavior I was expecting.

@Clock Slave 2016-03-16 11:22:07

what if I instead of x1 and x2 I want to use all the remaining variables (other than year, month)

@A5C1D2H2I1M1N2O1R2T1 2016-03-21 03:53:44

@ClockSlave, then you need to just use . on the LHS. aggregate(. ~ year + month, df1, sum, na.rm = TRUE). In this example, sum for "date" doesn't make sense though....

@skan 2016-04-14 19:15:13

What if I don't want two variables but two functions?. For example mean and sd.

@DatamineR 2017-06-23 16:03:17

In the case of NAs this approach is really problematic. Setting na.rm = TRUE does not affect anything and the NA cases are ignored...

@lmo 2017-07-13 02:05:15

@andrie. The use of . in the formula interface mentioned recently in the comments is probably worth adding to the answer.

@theforestecologist 2018-04-30 18:50:55

Is there a way to perform different functions (e.g., mean, max, min ,etc.) to each of the different variables in cbind?

@numbercruncher 2012-03-15 23:00:07

Using the data.table package, which is fast (useful for larger datasets)

https://github.com/Rdatatable/data.table/wiki

library(data.table)
df2 <- setDT(df1)[, lapply(.SD, sum), by=.(year, month), .SDcols=c("x1","x2")]
setDF(df2) # convert back to dataframe

Using the plyr package

require(plyr)
df2 <- ddply(df1, c("year", "month"), function(x) colSums(x[c("x1", "x2")]))

Using summarize() from the Hmisc package (column headings are messy in my example though)

# need to detach plyr because plyr and Hmisc both have a summarize()
detach(package:plyr)
require(Hmisc)
df2 <- with(df1, summarize( cbind(x1, x2), by=llist(year, month), FUN=colSums))

@Bulat 2018-10-13 12:00:09

why not do this for data.table option: dt[, .(x1.sum = sum(x1), x2.sum = sum(x2), by = c(year, month) ?

Related Questions

Sponsored Content

13 Answered Questions

[SOLVED] Group By Multiple Columns

13 Answered Questions

[SOLVED] How to sum a variable by group?

  • 2009-11-02 09:01:28
  • user5243421
  • 399312 View
  • 283 Score
  • 13 Answer
  • Tags:   r sorting r-faq

9 Answered Questions

[SOLVED] Grouping functions (tapply, by, aggregate) and the *apply family

0 Answered Questions

Replace only values of a given list of IDs in large PANEL DATA SET

  • 2017-06-08 16:45:11
  • Enrico
  • 63 View
  • 0 Score
  • 0 Answer
  • Tags:   r

2 Answered Questions

[SOLVED] In R: how to sum a variable by group between two dates

  • 2017-06-03 15:36:41
  • Gret-D
  • 1122 View
  • 1 Score
  • 2 Answer
  • Tags:   r date for-loop sum

1 Answered Questions

[SOLVED] lubridate year date in a data frame

3 Answered Questions

[SOLVED] Combine data within two data frames in R

  • 2016-11-01 19:09:57
  • MJ30
  • 56 View
  • 1 Score
  • 3 Answer
  • Tags:   r dataframe merge

3 Answered Questions

1 Answered Questions

[SOLVED] Create vector of data frame subsets based on group by of columns

  • 2014-02-19 23:30:27
  • stackoverflowuser2010
  • 2129 View
  • 3 Score
  • 1 Answer
  • Tags:   r vector dataframe

1 Answered Questions

Sponsored Content