I'm relatively familiar with
data.table, not so much with
dplyr. I've read through some
dplyr vignettes and examples that have popped up on SO, and so far my conclusions are that:
dplyrare comparable in speed, except when there are many (i.e. >10-100K) groups, and in some other circumstances (see benchmarks below)
dplyrhas more accessible syntax
dplyrabstracts (or will) potential DB interactions
- There are some minor functionality differences (see "Examples/Usage" below)
In my mind 2. doesn't bear much weight because I am fairly familiar with it
data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with
data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).
What I want to know is:
- Are there analytical tasks that are a lot easier to code with one or the other package for people familiar with the packages (i.e. some combination of keystrokes required vs. required level of esotericism, where less of each is a good thing).
- Are there analytical tasks that are performed substantially (i.e. more than 2x) more efficiently in one package vs. another.
One recent SO question got me thinking about this a bit more, because up until that point I didn't think
dplyr would offer much beyond what I can already do in
data.table. Here is the
dplyr solution (data at end of Q):
dat %.% group_by(name, job) %.% filter(job != "Boss" | year == min(year)) %.% mutate(cumu_job2 = cumsum(job2))
Which was much better than my hack attempt at a
data.table solution. That said, good
data.table solutions are also pretty good (thanks Jean-Robert, Arun, and note here I favored single statement over the strictly most optimal solution):
setDT(dat)[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], by=list(id, job) ]
The syntax for the latter may seem very esoteric, but it actually is pretty straightforward if you're used to
data.table (i.e. doesn't use some of the more esoteric tricks).
Ideally what I'd like to see is some good examples were the
data.table way is substantially more concise or performs substantially better.
dplyrdoes not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using
doin the answer to @eddi's question).
data.tablesupports rolling joins (thanks @dholstius) as well as overlap joins
data.tableinternally optimises expressions of the form
DT[col == value]or
DT[col %in% values]for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark.
dplyroffers standard evaluation versions of functions (e.g.
summarize_each_) that can simplify the programmatic use of
dplyr(note programmatic use of
data.tableis definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)
- I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point
data.tablebecomes substantially faster.
- @Arun ran some benchmarks on joins, showing that
data.tablescales better than
dplyras the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values has
- (Unverified) has
data.table75% faster on larger versions of a group/apply/sort while
dplyrwas 40% faster on the smaller ones (another SO question from comments, thanks danas).
- Matt, the main author of
data.table, has benchmarked grouping operations on
pandason up to 2 billion rows (~100GB in RAM).
- An older benchmark on 80K groups has
This is for the first example I showed in the question section.
dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", "name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, -16L))