By Ed Swindelles


2014-06-05 21:45:16 8 Comments

I wish to (1) group data by one variable (State), (2) within each group find the row of minimum value of another variable (Employees), and (3) extract the entire row.

(1) and (2) are easy one-liners, and I feel like (3) should be too, but I can't get it.

Here is a sample data set:

> data
  State Company Employees
1    AK       A        82
2    AK       B       104
3    AK       C        37
4    AK       D        24
5    RI       E        19
6    RI       F       118
7    RI       G        88
8    RI       H        42

data <- structure(list(State = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
        2L), .Label = c("AK", "RI"), class = "factor"), Company = structure(1:8, .Label = c("A", 
        "B", "C", "D", "E", "F", "G", "H"), class = "factor"), Employees = c(82L, 
        104L, 37L, 24L, 19L, 118L, 88L, 42L)), .Names = c("State", "Company", 
        "Employees"), class = "data.frame", row.names = c(NA, -8L))

Calculate min by group is easy, using aggregate:

> aggregate(Employees ~ State, data, function(x) min(x))
  State Employees
1    AK        24
2    RI        19

...or data.table:

> library(data.table)
> DT <- data.table(data)
> DT[ , list(Employees = min(Employees)), by = State]
   State Employees
1:    AK        24
2:    RI        19

But how do I extract the entire row corresponding to these min values, i.e. also including Company in the result?

5 comments

@David Arenburg 2017-01-24 20:27:00

As this is Google's top hit, I thought I would add some additional options which I find useful to know. The idea is basically to arrange once by Employees and then just take the uniques per State

Either using data.table

library(data.table)
unique(setDT(data)[order(Employees)], by = "State")
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Alternatively, we could also first order and then subset .SD. Both of those operations were optimized in the resent data.table versions and order is seemingly triggers data.table:::forderv, while .SD[1L] triggers Gforce

setDT(data)[order(Employees), .SD[1L], by = State, verbose = TRUE] # <- Added verbose
# order optimisation is on, i changed from 'order(...)' to 'forder(DT, ...)'.
# i clause present and columns used in by detected, only these subset: State 
# Finding groups using forderv ... 0 sec
# Finding group sizes from the positions (can be avoided to save RAM) ... 0 sec
# Getting back original order ... 0 sec
# lapply optimization changed j from '.SD[1L]' to 'list(Company[1L], Employees[1L])'
# GForce optimized j to 'list(`g[`(Company, 1L), `g[`(Employees, 1L))'
# Making each group and running j (GForce TRUE) ... 0 secs
#    State Company Employees
# 1:    RI       E        19
# 2:    AK       D        24

Or dplyr

library(dplyr)
data %>% 
  arrange(Employees) %>% 
  distinct(State, .keep_all = TRUE)
#   State Company Employees
# 1    RI       E        19
# 2    AK       D        24

Another interesting idea borrowed from @Khashaas awesome answer (with a small modification in form of mult = "first" in order to handle multiple matches) is to first find minimum per group and then perform a binary join back. The advantage of this is both the utilization of data.tables gmin function (which skips the evaluation overhead) and the binary join feature

tmp <- setDT(data)[, .(Employees = min(Employees)), by = State]
data[tmp, on = .(State, Employees), mult = "first"]
#    State Company Employees
# 1:    AK       D        24
# 2:    RI       E        19

Some benchmarks

library(data.table)
library(dplyr)
library(plyr)
library(stringi)
library(microbenchmark)

set.seed(123)
N <- 1e6
data <- data.frame(State = stri_rand_strings(N, 2, '[A-Z]'),
                   Employees = sample(N*10, N, replace = TRUE))
DT <- copy(data)
setDT(DT)
DT2 <- copy(DT)
str(DT)
str(DT2)

microbenchmark("(data.table) .SD[which.min]: " = DT[ , .SD[which.min(Employees)], by = State],
               "(data.table) .I[which.min]: " = DT[DT[ , .I[which.min(Employees)], by = State]$V1],
               "(data.table) order/unique: " = unique(DT[order(Employees)], by = "State"),
               "(data.table) order/.SD[1L]: " = DT[order(Employees), .SD[1L], by = State],
               "(data.table) self join (on):" = {
                 tmp <- DT[, .(Employees = min(Employees)), by = State]
                 DT[tmp, on = .(State, Employees), mult = "first"]},
               "(data.table) self join (setkey):" = {
                 tmp <- DT2[, .(Employees = min(Employees)), by = State] 
                 setkey(tmp, State, Employees)
                 setkey(DT2, State, Employees)
                 DT2[tmp, mult = "first"]},
               "(dplyr) slice(which.min): " = data %>% group_by(State) %>% slice(which.min(Employees)),
               "(dplyr) arrange/distinct: " = data %>% arrange(Employees) %>% distinct(State, .keep_all = TRUE),
               "(dplyr) arrange/group_by/slice: " = data %>% arrange(Employees) %>% group_by(State) %>% slice(1),
               "(plyr) ddply/which.min: " = ddply(data, .(State), function(x) x[which.min(x$Employees),]),
               "(base) by: " = do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ])))


# Unit: milliseconds
#                             expr        min         lq       mean     median         uq       max neval      cld
#    (data.table) .SD[which.min]:   119.66086  125.49202  145.57369  129.61172  152.02872  267.5713   100    d    
#     (data.table) .I[which.min]:    12.84948   13.66673   19.51432   13.97584   15.17900  109.5438   100 a       
#      (data.table) order/unique:    52.91915   54.63989   64.39212   59.15254   61.71133  177.1248   100  b      
#     (data.table) order/.SD[1L]:    51.41872   53.22794   58.17123   55.00228   59.00966  145.0341   100  b      
#     (data.table) self join (on):   44.37256   45.67364   50.32378   46.24578   50.69411  137.4724   100  b      
# (data.table) self join (setkey):   14.30543   15.28924   18.63739   15.58667   16.01017  106.0069   100 a       
#       (dplyr) slice(which.min):    82.60453   83.64146   94.06307   84.82078   90.09772  186.0848   100   c     
#       (dplyr) arrange/distinct:   344.81603  360.09167  385.52661  379.55676  395.29463  491.3893   100     e   
# (dplyr) arrange/group_by/slice:   367.95924  383.52719  414.99081  397.93646  425.92478  557.9553   100      f  
#         (plyr) ddply/which.min:   506.55354  530.22569  568.99493  552.65068  601.04582  727.9248   100       g 
#                      (base) by:  1220.38286 1291.70601 1340.56985 1344.86291 1382.38067 1512.5377   100        h

@C8H10N4O2 2017-02-07 20:44:12

Nicely done. Just for completeness / curiosity, I provided a correct plyr solution. Feel free to include in benchmark... I don't expect it will stand up to data.table though.

@David Arenburg 2017-02-13 13:24:13

@C8H10N4O2 Updated.

@C8H10N4O2 2017-02-07 20:42:21

Corrected plyr solution:

ddply(df, .(State), function(x) x[which.min(x$Employees),])
#   State Company Employees
# 1    AK       D        24
# 2    RI       E        19

thanks to @joel.wilson

@Señor O 2014-06-05 21:50:08

Slightly more elegant:

library(data.table)
DT[ , .SD[which.min(Employees)], by = State]

   State Company Employees
1:    AK       D        24
2:    RI       E        19

Slighly less elegant than using .SD, but a bit faster (for data with many groups):

DT[DT[ , .I[which.min(Employees)], by = State]$V1]

Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data set has multiple identical min values and you'd like to subset all of them.

See also Subset by group with data.table.

@Abhishek Singh 2017-05-24 05:16:32

How to get it if the number should return the minimum but based on the number of frequency that number has been used?

@agstudy 2014-06-05 22:08:00

A dplyr solution :

library(dplyr)    
data %>% 
    group_by(State) %>% 
    slice(which.min(Employees))

@MrFlick 2014-06-05 21:55:19

The base function by is often useful for working with block data in data.frames. For example

by(data, data$State, function(x) x[which.min(x$Employees), ] )

It does return the data in a list, but you can collapse that with

do.call(rbind, by(data, data$State, function(x) x[which.min(x$Employees), ] ))

Related Questions

Sponsored Content

3 Answered Questions

[SOLVED] How do I label the group for consecutive pattern in R?

  • 2019-08-12 01:00:17
  • Juhyeon
  • 59 View
  • 2 Score
  • 3 Answer
  • Tags:   r label

1 Answered Questions

[SOLVED] How do I join repeatedly the tables in r?

  • 2019-07-17 02:18:34
  • Juhyeon
  • 52 View
  • 1 Score
  • 1 Answer
  • Tags:   r join left-join

16 Answered Questions

[SOLVED] Remove rows with all or some NAs (missing values) in data.frame

1 Answered Questions

[SOLVED] R- How to use map() into map()

  • 2018-08-20 15:00:48
  • Arthur Vaz
  • 133 View
  • 0 Score
  • 1 Answer
  • Tags:   r dplyr purrr

3 Answered Questions

[SOLVED] Data transformation: from dyadic to observational data in R

1 Answered Questions

[SOLVED] Aggregate with max and factors

1 Answered Questions

2 Answered Questions

[SOLVED] What is the purpose of setting a key in data.table?

  • 2013-11-18 02:56:49
  • Wet Feet
  • 35836 View
  • 111 Score
  • 2 Answer
  • Tags:   r data.table

Sponsored Content