data.table vs dplyr: can one do something well the other can't or does poorly?
We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed
, Memory usage
, Syntax
and Features
.
My intent is to cover each one of these as clearly as possible from data.table perspective.
Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.
The data.table syntax is consistent in its form - DT[i, j, by]
. To keep i
, j
and by
together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.
1. Speed
Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas
. See also updated benchmarks, which include Spark
and pydatatable
as well.
On benchmarks, it would be great to cover these remaining aspects as well:
Grouping operations involving a subset of rows - i.e.,
DT[x > val, sum(y), by = z]
type operations.Benchmark other operations such as update and joins.
Also benchmark memory footprint for each operation in addition to runtime.
2. Memory usage
Operations involving
filter()
orslice()
in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.
data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).
# sub-assign by reference, updates 'y' in-place
DT[x >= 1L, y := NA]But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):
# copies the entire 'y' column
ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.
Therefore we are working towards exporting
shallow()
function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:foo <- function(DT) {
DT = shallow(DT) ## shallow copy DT
DT[, newcol := 1L] ## does not affect the original DT
DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT
DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will
## also get modified.
}By not using
shallow()
, the old functionality is retained:bar <- function(DT) {
DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference
DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT.
}By creating a shallow copy using
shallow()
, we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.Also, once
shallow()
is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.But it will still lack many features that data.table provides, including (sub)-assignment by reference.
Aggregate while joining:
Suppose you have two data.tables as follows:
DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
# x y z
# 1: 1 a 1
# 2: 1 a 2
# 3: 1 b 3
# 4: 1 b 4
# 5: 2 a 5
# 6: 2 a 6
# 7: 2 b 7
# 8: 2 b 8
DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
# x y mul
# 1: 1 a 4
# 2: 2 b 3And you would like to get
sum(z) * mul
for each row inDT2
while joining by columnsx,y
. We can either:aggregate
DT1
to getsum(z)
, 2) perform a join and 3) multiply (or)data.table way
DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
dplyr equivalent
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
right_join(DF2) %>% mutate(z = z * mul)
do it all in one go (using
by = .EACHI
feature):DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
What is the advantage?
We don't have to allocate memory for the intermediate result.
We don't have to group/hash twice (one for aggregation and other for joining).
And more importantly, the operation what we wanted to perform is clear by looking at
j
in (2).
Check this post for a detailed explanation of
by = .EACHI
. No intermediate results are materialised, and the join+aggregate is performed all in one go.Have a look at this, this and this posts for real usage scenarios.
In
dplyr
you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).Update and joins:
Consider the data.table code shown below:
DT1[DT2, col := i.mul]
adds/updates
DT1
's columncol
withmul
fromDT2
on those rows whereDT2
's key column matchesDT1
. I don't think there is an exact equivalent of this operation indplyr
, i.e., without avoiding a*_join
operation, which would have to copy the entireDT1
just to add a new column to it, which is unnecessary.Check this post for a real usage scenario.
To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!
3. Syntax
Let's now look at syntax. Hadley commented here:
Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...
I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.
We will work with the dummy data shown below:
DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
Basic aggregation/update operations.
# case (a)
DT[, sum(y), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
DT[, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
# case (b)
DT[x > 2, sum(y), by = z]
DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
DT[x > 2, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
# case (c)
DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
In case (b), we had to use
filter()
in dplyr while summarising. But while updating, we had to move the logic insidemutate()
. In data.table however, we express both operations with the same logic - operate on rows wherex > 2
, but in first case, getsum(y)
, whereas in the second case update those rows fory
with its cumulative sum.This is what we mean when we say the
DT[i, j, by]
form is consistent.Similarly in case (c), when we have
if-else
condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where theif
condition satisfies and skip otherwise, we cannot usesummarise()
directly (AFAICT). We have tofilter()
first and then summarise becausesummarise()
always expects a single value.While it returns the same result, using
filter()
here makes the actual operation less obvious.It might very well be possible to use
filter()
in the first case as well (does not seem obvious to me), but my point is that we should not have to.
Aggregation / update on multiple columns
# case (a)
DT[, lapply(.SD, sum), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
DT[, (cols) := lapply(.SD, sum), by = z]
ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
# case (b)
DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
# case (c)
DT[, c(.N, lapply(.SD, sum)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(n(), mean))In case (a), the codes are more or less equivalent. data.table uses familiar base function
lapply()
, whereasdplyr
introduces*_each()
along with a bunch of functions tofuns()
.data.table's
:=
requires column names to be provided, whereas dplyr generates it automatically.In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return
n()
as many times as many columns, instead of just once. In data.table, all we need to do is to return a list inj
. Each element of the list will become a column in the result. So, we can use, once again, the familiar base functionc()
to concatenate.N
to alist
which returns alist
.
Note: Once again, in data.table, all we need to do is return a list in
j
. Each element of the list will become a column in result. You can usec()
,as.list()
,lapply()
,list()
etc... base functions to accomplish this, without having to learn any new functions.You will need to learn just the special variables -
.N
and.SD
at least. The equivalent in dplyr aren()
and.
Joins
dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax
DT[i, j, by]
(and with reason). It also provides an equivalentmerge.data.table()
function as an alternative.setkey(DT1, x, y)
# 1. normal join
DT1[DT2] ## data.table syntax
left_join(DT2, DT1) ## dplyr syntax
# 2. select columns while join
DT1[DT2, .(z, i.mul)]
left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
# 3. aggregate while join
DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
# 4. update while join
DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
??
# 5. rolling join
DT1[DT2, roll = -Inf]
??
# 6. other arguments to control output
DT1[DT2, mult = "first"]
??
Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's
DT[i, j, by]
, ormerge()
which is similar to base R.However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to
select()
first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?
data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has
mult =
argument which selects first, last or all matches (6).data.table has
allow.cartesian = TRUE
argument to protect from accidental invalid joins.
Once again, the syntax is consistent with
DT[i, j, by]
with additional arguments allowing for controlling the output further.
do()
...dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to
do()
. You have to know beforehand about all your functions return value.DT[, list(x[1], y[1]), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
DT[, list(x[1:2], y[1]), by = z]
DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
DT[, quantile(x, 0.25), by = z]
DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
DT[, quantile(x, c(0.25, 0.75)), by = z]
DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
DT[, as.list(summary(x)), by = z]
DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
.SD
's equivalent is.
In data.table, you can throw pretty much anything in
j
- the only thing to remember is for it to return a list so that each element of the list gets converted to a column.In dplyr, cannot do that. Have to resort to
do()
depending on how sure you are as to whether your function would always return a single value. And it is quite slow.
Once again, data.table's syntax is consistent with
DT[i, j, by]
. We can just keep throwing expressions inj
without having to worry about these things.
Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...
To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.
data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.
But one should also consider the number of features that dplyr lacks in comparison to data.table.
4. Features
I have pointed out most of the features here and also in this post. In addition:
fread - fast file reader has been available for a long time now.
fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
Ad-hoc grouping:
dplyr
automatically sorts the results by grouping variables duringsummarise()
, which may not be always desirable.Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.
Non-equi joins: Allows joins using other operators
<=, <, >, >=
along with all other advantages of data.table joins.Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.
setorder()
function in data.table that allows really fast reordering of data.tables by reference.dplyr provides interface to databases using the same syntax, which data.table does not at the moment.
data.table
provides faster equivalents of set operations (written by Jan Gorecki) -fsetdiff
,fintersect
,funion
andfsetequal
with additionalall
argument (as in SQL).data.table loads cleanly with no masking warnings and has a mechanism described here for
[.data.frame
compatibility when passed to any R package. dplyr changes base functionsfilter
,lag
and[
which can cause problems; e.g. here and here.
Finally:
On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.
On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using
OpenMP
.
- Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using
What you can do with a data.frame that you can't with a data.table?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights,
j
in[.data.table
is fundamentally
different fromj
in[.data.frame
. Even something as simple as
DF[,1]
would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).Furthermore,
data.table
inherits fromdata.frame
. It is a
data.frame
, too. Adata.table
can be passed to any package that
only acceptsdata.frame
and that package can use[.data.frame
syntax on thedata.table
.We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :
unique()
andmatch()
are now faster on character vectors where all elements are in the globalCHARSXP
cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated inunique.
c.
A second proposal was to use
memcpy
induplicate.c
, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame
and data.table
DT[3]
refers to the 3rd row, butDF[3]
refers to the 3rd columnDT[3, ] == DT[3]
, butDF[ , 3] == DF[3]
(somewhat confusingly in data.frame, whereas data.table is consistent)- For this reason we say the comma is optional in
DT
, but not optional inDF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ]
, wherei
is a single integer, returns a single row, just likeDF[i, ]
, but unlike a matrix single-row subset which returns a vector.DT[ , j]
wherej
is a single integer returns a one-column data.table, unlikeDF[, j]
which returns a vector by defaultDT[ , "colA"][[1]] == DF[ , "colA"]
.DT[ , colA] == DF[ , "colA"]
(currently in data.table v1.9.8 but is about to change, see release notes)DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA]
returns 1 row ofNA
, butDF[NA]
returns an entire copy ofDF
containingNA
throughout. The symbolNA
is typelogical
in R and is therefore recycled by[.data.frame
. The user's intention was probablyDF[NA_integer_]
.[.data.table
diverts to this probable intention automatically, for convenience.DT[c(TRUE, NA, FALSE)]
treats theNA
asFALSE
, butDF[c(TRUE, NA, FALSE)]
returns
NA
rows for eachNA
DT[ColA == ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4))
creates 3 columns, data.table creates onelist
column.check.names
is by defaultTRUE
indata.frame
butFALSE
in data.table, for convenience.stringsAsFactors
is by defaultTRUE
indata.frame
butFALSE
in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting tofactor
.- Atomic vectors in
list
columns are collapsed when printed using", "
indata.frame
, but","
in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In[.data.frame
we very often setdrop = FALSE
. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single columndata.frame
. In[.data.table
we took the opportunity to make it consistent and droppeddrop
.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table
is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
- base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.- An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
data.table difference set of columns from another column
You can subset .SD
using mycols
and subtract a
:
dt[, paste0("diff", mycols) := .SD[, mycols, with = FALSE] - a ]
# a b d diffb diffd
# 1: 1 11 21 10 20
# 2: 2 12 22 10 20
# 3: 3 13 23 10 20
# 4: 4 14 24 10 20
# 5: 5 15 25 10 20
# 6: 6 16 26 10 20
# 7: 7 17 27 10 20
# 8: 8 18 28 10 20
# 9: 9 19 29 10 20
#10: 10 20 30 10 20
As Frank pointed out in the comments, this works, too
dt[, paste0("diff", mycols) := .SD - dt$a, .SDcols=mycols]
Not sure what's better practice, though.
dplyr on data.table, am I really using data.table?
There is no straightforward/simple answer because the philosophies of both these packages differ in certain aspects. So some compromises are unavoidable. Here are some of the concerns you may need to address/consider.
Operations involving i
(== filter()
and slice()
in dplyr)
Assume DT
with say 10 columns. Consider these data.table expressions:
DT[a > 1, .N] ## --- (1)
DT[a > 1, mean(b), by=.(c, d)] ## --- (2)
(1) gives the number of rows in DT
where column a > 1
. (2) returns mean(b)
grouped by c,d
for the same expression in i
as (1).
Commonly used dplyr
expressions would be:
DT %>% filter(a > 1) %>% summarise(n()) ## --- (3)
DT %>% filter(a > 1) %>% group_by(c, d) %>% summarise(mean(b)) ## --- (4)
Clearly, data.table codes are shorter. In addition they are also more memory efficient1. Why? Because in both (3) and (4), filter()
returns rows for all 10 columns first, when in (3) we just need the number of rows, and in (4) we just need columns b, c, d
for the successive operations. To overcome this, we have to select()
columns apriori:
DT %>% select(a) %>% filter(a > 1) %>% summarise(n()) ## --- (5)
DT %>% select(a,b,c,d) %>% filter(a > 1) %>% group_by(c,d) %>% summarise(mean(b)) ## --- (6)
It is essential to highlight a major philosophical difference between the two packages:
In
data.table
, we like to keep these related operations together, and that allows to look at thej-expression
(from the same function call) and realise there's no need for any columns in (1). The expression ini
gets computed, and.N
is just sum of that logical vector which gives the number of rows; the entire subset is never realised. In (2), just columnb,c,d
are materialised in the subset, other columns are ignored.But in
dplyr
, the philosophy is to have a function do precisely one thing well. There is (at least currently) no way to tell if the operation afterfilter()
needs all those columns we filtered. You'll need to think ahead if you want to perform such tasks efficiently. I personally find it counter-intutitive in this case.
Note that in (5) and (6), we still subset column a
which we don't require. But I'm not sure how to avoid that. If filter()
function had an argument to select the columns to return, we could avoid this issue, but then the function will not do just one task (which is also a dplyr design choice).
Sub-assign by reference
dplyr will never update by reference. This is another huge (philosophical) difference between the two packages.
For example, in data.table you can do:
DT[a %in% some_vals, a := NA]
which updates column a
by reference on just those rows that satisfy the condition. At the moment dplyr deep copies the entire data.table internally to add a new column. @BrodieG already mentioned this in his answer.
But the deep copy can be replaced by a shallow copy when FR #617 is implemented. Also relevant: dplyr: FR#614. Note that still, the column you modify will always be copied (therefore tad slower / less memory efficient). There will be no way to update columns by reference.
Other functionalities
In data.table, you can aggregate while joining, and this is more straightfoward to understand and is memory efficient since the intermediate join result is never materialised. Check this post for an example. You can't (at the moment?) do that using dplyr's data.table/data.frame syntax.
data.table's rolling joins feature is not supported in dplyr's syntax as well.
We recently implemented overlap joins in data.table to join over interval ranges (here's an example), which is a separate function
foverlaps()
at the moment, and therefore could be used with the pipe operators (magrittr / pipeR? - never tried it myself).But ultimately, our goal is to integrate it into
[.data.table
so that we can harvest the other features like grouping, aggregating while joining etc.. which will have the same limitations outlined above.Since 1.9.4, data.table implements automatic indexing using secondary keys for fast binary search based subsets on regular R syntax. Ex:
DT[x == 1]
andDT[x %in% some_vals]
will automatically create an index on the first run, which will then be used on successive subsets from the same column to fast subset using binary search.
Related Topics
R Shiny Error: Object of Type 'Closure' Is Not Subsettable
Smart Way to Chain Ifelse Statements
Find Overlapping Regions and Extract Respective Value
R Converting from Datetime to Date
How to Pass Column Name as Argument to Function for Dplyr Verbs
R // Sum by Based on Date Range
Coerce Logical (Boolean) Vector to 0 and 1
R - Replace Specific Value Contents with Na
Install.Packages R on Ubuntu 12.04 Downloads But Does Not Install Packages
Data.Table := Assignments When Variable Has Same Name as a Column
Testing a Function That Uses Enquo() for a Null Parameter
Harvest (Rvest) Multiple HTML Pages from a List of Urls
How to Read a Text File into Gnu R with a Multiple-Byte Separator