By Ianthe


2014-02-24 08:56:59 8 Comments

I have a data.frame and I need to calculate the mean per group (i.e. per Month, below).

Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32

My desired output is like below, where the values for Rate1 and Rate2 are the group means. Please disregard the value, I have made it up for the example.

Name       Rate1       Rate2
Aira        23.21       12.2
Ben         45.23       43.9
Cat         33.22       32.2

8 comments

@Sam Firke 2015-02-23 15:25:45

Or use group_by & summarise_at from the dplyr package:

library(dplyr)

d %>%
  group_by(Name) %>%
  summarise_at(vars(-Month), funs(mean(., na.rm=TRUE)))

# A tibble: 3 x 3
  Name  Rate1 Rate2
  <fct> <dbl> <dbl>
1 Aira   16.3  47.0
2 Ben    31.3  50.3
3 Cat    44.7  54.0

See ?summarise_at for the many ways to specify the variables to act on. Here, vars(-Month) says all variables except Month.

@user1883491 2017-12-04 22:22:25

summarise _each is deprecated in new version. What should I use intead? I tried summarise_at or _if or_all. but didn't work .

@CER 2018-01-19 19:51:12

try d %>% group_by(Name) %>% summarise_at(.vars = names(.)[3:4],.funs = c(mean="mean"))

@Becky 2016-02-05 01:58:42

You could also use the generic function cbind() and lm() without the intercept:

cbind(lm(d$Rate1~-1+d$Name)$coef,lm(d$Rate2~-1+d$Name)$coef)
>               [,1]     [,2]
>d$NameAira 16.33333 47.00000
>d$NameBen  31.33333 50.33333
>d$NameCat  44.66667 54.00000

@duHaas 2014-02-25 03:57:17

A third great alternative is using the package data.table, which also has the class data.frame, but operations like you are looking for are computed much faster.

library(data.table)
mydt <- structure(list(Name = c("Aira", "Aira", "Aira", "Ben", "Ben", "Ben", "Cat", "Cat", "Cat"), Month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Rate1 = c(15.6396600443877, 2.15649279424609, 6.24692918928743, 2.37658797276116, 34.7500663272292, 3.28750138697048, 29.3265553981065, 17.9821839334431, 10.8639802575958), Rate2 = c(17.1680489538369, 5.84231656330206, 8.54330866437461, 5.88415184986176, 3.02064294862551, 17.2053351400752, 16.9552950199166, 2.56058000170089, 15.7496228048122)), .Names = c("Name", "Month", "Rate1", "Rate2"), row.names = c(NA, -9L), class = c("data.table", "data.frame"))

Now to take the mean of Rate1 and Rate2 for all 3 months, for each person (Name): First, decide which columns you want to take the mean of

colstoavg <- names(mydt)[3:4]

Now we use lapply to take the mean over the columns we want to avg (colstoavg)

mydt.mean <- mydt[,lapply(.SD,mean,na.rm=TRUE),by=Name,.SDcols=colstoavg]

 mydt.mean
   Name     Rate1     Rate2
1: Aira  8.014361 10.517891
2:  Ben 13.471385  8.703377
3:  Cat 19.390907 11.755166

@joemienko 2016-05-17 12:30:02

You can also accomplish this using the sqldf package as shown below:

library(sqldf)

x <- read.table(text='Name     Month  Rate1     Rate2
Aira       1      12        23
                Aira       2      18        73
                Aira       3      19        45
                Ben        1      53        19
                Ben        2      22        87
                Ben        3      19        45
                Cat        1      22        87
                Cat        2      67        43
                Cat        3      45        32', header=TRUE)

sqldf("
select 
  Name
  ,avg(Rate1) as Rate1_float
  ,avg(Rate2) as Rate2_float
  ,avg(Rate1) as Rate1
  ,avg(Rate2) as Rate2
from x
group by 
  Name
")

#  Name Rate1_float Rate2_float Rate1 Rate2
#1 Aira    16.33333    47.00000    16    47
#2  Ben    31.33333    50.33333    31    50
#3  Cat    44.66667    54.00000    44    54

I am a recent convert to dplyr as shown in other answers, but sqldf is nice as most data analysts/data scientists/developers have at least some fluency in SQL. In this way, I think it tends to make for more universally readable code than dplyr or other solutions presented above.

UPDATE: In responding to the comment below, I attempted to update the code as shown above. However, the behavior was not as I expected. It seems that the column definition (i.e. int vs float) is only carried through when the column alias matches the original column name. When you specify a new name, the aggregate column is returned without rounding.

@PatrickT 2016-06-19 07:23:13

could you comment on the rounding?

@joemienko 2016-06-21 16:21:02

Hi @partickt (and sorry for the delay), this is a nuance of SQL that is described here: stackoverflow.com/questions/18493976/…. As you can see, since the SQL is starting with an int, it is keeping it as an int. avg(cast(Ratei as float)) should get a decimal value that could wrap with round as needed. avg(Ratei*1.0) should also coerce to a float...have not run either

@PatrickT 2016-06-22 09:27:11

Thanks joemienko! I tried your suggestions but I couldn't make them work. However, please see my edit and consider if you'd like to keep it. I have used different indentations as yours, I'm not familiar with sqldf so if there is a standard way to indent, please edit. Thanks!

@joemienko 2016-06-22 13:37:24

@partickt, I was mistaken! See the updated answer above.

@joemienko 2016-06-22 22:12:49

pesky commas, try it now!

@PatrickT 2016-06-23 05:54:08

It's working! You could comment that getting a float is as easy as appending _float to the variable name, it's a convenient feature!

@Mark Miller 2016-06-20 03:55:33

Here are a variety of ways to do this in base R including an alternative aggregate approach. The examples below return means per month, which I think is what you requested. Although, the same approach could be used to return means per person:

Using ave:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

Rate1.mean <- with(my.data, ave(Rate1, Month, FUN = function(x) mean(x, na.rm = TRUE)))
Rate2.mean <- with(my.data, ave(Rate2, Month, FUN = function(x) mean(x, na.rm = TRUE)))

my.data <- data.frame(my.data, Rate1.mean, Rate2.mean)
my.data

Using by:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

by.month <- as.data.frame(do.call("rbind", by(my.data, my.data$Month, FUN = function(x) colMeans(x[,3:4]))))
colnames(by.month) <- c('Rate1.mean', 'Rate2.mean')
by.month <- cbind(Month = rownames(by.month), by.month)

my.data <- merge(my.data, by.month, by = 'Month')
my.data

Using lapply and split:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

ly.mean <- lapply(split(my.data, my.data$Month), function(x) c(Mean = colMeans(x[,3:4])))
ly.mean <- as.data.frame(do.call("rbind", ly.mean))
ly.mean <- cbind(Month = rownames(ly.mean), ly.mean)

my.data <- merge(my.data, ly.mean, by = 'Month')
my.data

Using sapply and split:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')
my.data

sy.mean <- t(sapply(split(my.data, my.data$Month), function(x) colMeans(x[,3:4])))
colnames(sy.mean) <- c('Rate1.mean', 'Rate2.mean')
sy.mean <- data.frame(Month = rownames(sy.mean), sy.mean, stringsAsFactors = FALSE)
my.data <- merge(my.data, sy.mean, by = 'Month')
my.data

Using aggregate:

my.data <- read.table(text = '
     Name     Month  Rate1     Rate2
     Aira       1      12        23
     Aira       2      18        73
     Aira       3      19        45
     Ben        1      53        19
     Ben        2      22        87
     Ben        3      19        45
     Cat        1      22        87
     Cat        2      67        43
     Cat        3      45        32
', header = TRUE, stringsAsFactors = FALSE, na.strings = 'NA')

my.summary <- with(my.data, aggregate(list(Rate1, Rate2), by = list(Month), 
                   FUN = function(x) { mon.mean = mean(x, na.rm = TRUE) } ))

my.summary <- do.call(data.frame, my.summary)
colnames(my.summary) <- c('Month', 'Rate1.mean', 'Rate2.mean')
my.summary

my.data <- merge(my.data, my.summary, by = 'Month')
my.data

@user6376316 2016-05-25 17:21:06

I describe two ways to do this, one based on data.table and the other based on reshape2 package . The data.table way already has an answer, but I have tried to make it cleaner and more detailed.

The data is like this:

 d <- structure(list(Name = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L, 3L), .Label = c("Aira", "Ben", "Cat"), class = "factor"), 
    Month = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), Rate1 = c(12L, 
    18L, 19L, 53L, 22L, 19L, 22L, 67L, 45L), Rate2 = c(23L, 73L, 
    45L, 19L, 87L, 45L, 87L, 43L, 32L)), .Names = c("Name", "Month", 
"Rate1", "Rate2"), class = "data.frame", row.names = c(NA, -9L
))
head(d)
  Name Month Rate1 Rate2
1 Aira     1    12    23
2 Aira     2    18    73
3 Aira     3    19    45
4  Ben     1    53    19
5  Ben     2    22    87
6  Ben     3    19    45


library("reshape2")
mym <- melt(d, id = c("Name"))
res <- dcast(mym, Name ~ variable, mean)
res
#Name Month    Rate1    Rate2
#1 Aira     2 16.33333 47.00000
#2  Ben     2 31.33333 50.33333
#3  Cat     2 44.66667 54.00000

Using data.table:

# At first, I convert the data.frame to data.table and then I group it 
setDT(d)
d[, .(Rate1 = mean(Rate1), Rate2 = mean(Rate2)), by = .(Name)]
#   Name    Rate1    Rate2
#1: Aira 16.33333 47.00000
#2:  Ben 31.33333 50.33333
#3:  Cat 44.66667 54.00000

There is another way of doing it by avoiding to write many argument for j in data.table using a .SD

d[, lapply(.SD, mean), by = .(Name)]
#   Name Month    Rate1    Rate2
#1: Aira     2 16.33333 47.00000
#2:  Ben     2 31.33333 50.33333
#3:  Cat     2 44.66667 54.00000

if we only want to have Rate1 and Rate2 then we can use the .SDcols as follows:

d[, lapply(.SD, mean), by = .(Name), .SDcols = 3:4]
#  Name    Rate1    Rate2
#1: Aira 16.33333 47.00000
#2:  Ben 31.33333 50.33333
#3:  Cat 44.66667 54.00000

@Arun 2016-05-25 18:41:22

Just realised there's already an answer though: stackoverflow.com/a/22004271/559784

@jbaums 2014-02-24 09:01:06

This type of operation is exactly what aggregate was designed for:

d <- read.table(text='Name     Month  Rate1     Rate2
Aira       1      12        23
Aira       2      18        73
Aira       3      19        45
Ben        1      53        19
Ben        2      22        87
Ben        3      19        45
Cat        1      22        87
Cat        2      67        43
Cat        3      45        32', header=TRUE)

aggregate(d[, 3:4], list(d$Name), mean)

  Group.1    Rate1    Rate2
1    Aira 16.33333 47.00000
2     Ben 31.33333 50.33333
3     Cat 44.66667 54.00000

Here we aggregate columns 3 and 4 of data.frame d, grouping by d$Name, and applying the mean function.


Or, using a formula interface:

aggregate(. ~ Name, d[-2], mean)

@Ianthe 2014-02-24 09:16:58

Thanks jbaums. Can I change the Group.1 to Name?

@jbaums 2014-02-24 09:17:28

Yes, by changing it to aggregate(d[, 3:4], list(Name=d$Name), mean)

@jdepypere 2015-05-11 23:12:50

When using aggregate(as.numeric(matrix$value), list(matrix$hour), mean) I am getting some NaN-values for some reason. Checking my data.table for is.nan() and is.na() however does not display any results. Any thoughts?

@jbaums 2015-05-11 23:25:51

@jdepypere - can't think why, but it might pay to take a careful look at the elements of split(matrix$value, matrix$hour) that correspond to those that return NaN from your aggregate (i.e. split(matrix$value, matrix$hour)[is.nan(aggregate(as.numeric(matrix$value), list(matrix$hour), mean)[, 2])])

@jdepypere 2015-05-15 10:51:03

@jbaums - apparantly something was wrong in my code and there was indeed some NaN-values. Thanks for the quick fix!

@Superbest 2015-11-09 20:51:07

What is the . in the formula version?

@jbaums 2015-11-09 20:58:51

@Superbest the . represents all elements (columns of the data frame) other than the thing on the right hand side of the ~.

@Superbest 2015-11-09 21:00:36

Thanks! Also, can this be used to get the number of rows instead of their mean?

@jbaums 2015-11-09 21:02:34

@Superbest yes, substitute length for mean.

@PatrickT 2016-06-19 07:04:17

Great answer. Either aggregate(d[, 3:4], list(d[,1]), mean) or aggregate(d[, c('Rate1', 'Rate2')], list(d[, c('Name')]), mean) is more "consistent", imho.

@Jinhua Wang 2019-04-20 13:57:14

Thanks! This looks very elegant!

@Zbynek 2014-02-24 09:06:00

You can also use package plyr, which is somehow more versatile:

library(plyr)

ddply(d, .(Name), summarize,  Rate1=mean(Rate1), Rate2=mean(Rate2))

  Name    Rate1    Rate2
1 Aira 16.33333 47.00000
2  Ben 31.33333 50.33333
3  Cat 44.66667 54.00000

Related Questions

Sponsored Content

16 Answered Questions

[SOLVED] Remove rows with all or some NAs (missing values) in data.frame

5 Answered Questions

[SOLVED] Pass a data.frame column name to a function

  • 2010-04-14 23:04:18
  • kmm
  • 87671 View
  • 92 Score
  • 5 Answer
  • Tags:   r dataframe r-faq

13 Answered Questions

[SOLVED] Group By Multiple Columns

3 Answered Questions

[SOLVED] data.table vs dplyr: can one do something well the other can't or does poorly?

  • 2014-01-29 15:21:45
  • BrodieG
  • 101177 View
  • 666 Score
  • 3 Answer
  • Tags:   r data.table dplyr

9 Answered Questions

[SOLVED] Grouping functions (tapply, by, aggregate) and the *apply family

4 Answered Questions

[SOLVED] mean of n rows by grouping another column in r

1 Answered Questions

[SOLVED] Replace the calculated group mean back into dataframe

  • 2016-12-28 10:36:08
  • Hardik gupta
  • 45 View
  • 0 Score
  • 1 Answer
  • Tags:   r aggregate

5 Answered Questions

[SOLVED] Mean per group in a column, result per row

  • 2015-05-12 02:02:04
  • apples-oranges
  • 498 View
  • 0 Score
  • 5 Answer
  • Tags:   r dataframe mean

1 Answered Questions

[SOLVED] Find the mean of y per grouped value of x

  • 2015-01-23 00:21:29
  • Hans
  • 674 View
  • 0 Score
  • 1 Answer
  • Tags:   r

1 Answered Questions

[SOLVED] Function to store data.frames and calculate mean?

Sponsored Content