Data.Table VS Dplyr: Can One Do Something Well the Other Can't or Does Poorly

data.table vs dplyr: can one do something well the other can't or does poorly?

We need to cover at least these aspects to provide a comprehensive answer/comparison (in no particular order of importance): Speed, Memory usage, Syntax and Features.

My intent is to cover each one of these as clearly as possible from data.table perspective.

Note: unless explicitly mentioned otherwise, by referring to dplyr, we refer to dplyr's data.frame interface whose internals are in C++ using Rcpp.

The data.table syntax is consistent in its form - DT[i, j, by]. To keep i, j and by together is by design. By keeping related operations together, it allows to easily optimise operations for speed and more importantly memory usage, and also provide some powerful features, all while maintaining the consistency in syntax.

1. Speed

Quite a few benchmarks (though mostly on grouping operations) have been added to the question already showing data.table gets faster than dplyr as the number of groups and/or rows to group by increase, including benchmarks by Matt on grouping from 10 million to 2 billion rows (100GB in RAM) on 100 - 10 million groups and varying grouping columns, which also compares pandas. See also updated benchmarks, which include Spark and pydatatable as well.

On benchmarks, it would be great to cover these remaining aspects as well:

  • Grouping operations involving a subset of rows - i.e., DT[x > val, sum(y), by = z] type operations.

  • Benchmark other operations such as update and joins.

  • Also benchmark memory footprint for each operation in addition to runtime.

2. Memory usage

  1. Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.

    Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.

  2. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).

     # sub-assign by reference, updates 'y' in-place
    DT[x >= 1L, y := NA]

    But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):

     # copies the entire 'y' column
    ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))

    A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.

    Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:

     foo <- function(DT) {
    DT = shallow(DT) ## shallow copy DT
    DT[, newcol := 1L] ## does not affect the original DT
    DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT
    DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will
    ## also get modified.

    By not using shallow(), the old functionality is retained:

     bar <- function(DT) {
    DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference
    DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT.

    By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.

    Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.

    But it will still lack many features that data.table provides, including (sub)-assignment by reference.

  3. Aggregate while joining:

    Suppose you have two data.tables as follows:

     DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
    # x y z
    # 1: 1 a 1
    # 2: 1 a 2
    # 3: 1 b 3
    # 4: 1 b 4
    # 5: 2 a 5
    # 6: 2 a 6
    # 7: 2 b 7
    # 8: 2 b 8
    DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
    # x y mul
    # 1: 1 a 4
    # 2: 2 b 3

    And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:

      1. aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)

        data.table way

        DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]

        dplyr equivalent

        DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
        right_join(DF2) %>% mutate(z = z * mul)

      1. do it all in one go (using by = .EACHI feature):

        DT1[DT2, list(z=sum(z) * mul), by = .EACHI]

    What is the advantage?

    • We don't have to allocate memory for the intermediate result.

    • We don't have to group/hash twice (one for aggregation and other for joining).

    • And more importantly, the operation what we wanted to perform is clear by looking at j in (2).

    Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.

    Have a look at this, this and this posts for real usage scenarios.

    In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).

  4. Update and joins:

    Consider the data.table code shown below:

     DT1[DT2, col := i.mul]

    adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.

    Check this post for a real usage scenario.

To summarise, it is important to realise that every bit of optimisation matters. As Grace Hopper would say, Mind your nanoseconds!

3. Syntax

Let's now look at syntax. Hadley commented here:

Data tables are extremely fast but I think their concision makes it harder to learn and code that uses it is harder to read after you have written it ...

I find this remark pointless because it is very subjective. What we can perhaps try is to contrast consistency in syntax. We will compare data.table and dplyr syntax side-by-side.

We will work with the dummy data shown below:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF =
  1. Basic aggregation/update operations.

     # case (a)
    DT[, sum(y), by = z] ## data.table syntax
    DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
    DT[, y := cumsum(y), by = z]
    ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))

    # case (b)
    DT[x > 2, sum(y), by = z]
    DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
    DT[x > 2, y := cumsum(y), by = z]
    ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))

    # case (c)
    DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
    DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
    DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
    DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
    • data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).

    • In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.

      This is what we mean when we say the DT[i, j, by] form is consistent.

    • Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.

      While it returns the same result, using filter() here makes the actual operation less obvious.

      It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.

  2. Aggregation / update on multiple columns

     # case (a)
    DT[, lapply(.SD, sum), by = z] ## data.table syntax
    DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
    DT[, (cols) := lapply(.SD, sum), by = z]
    ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))

    # case (b)
    DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
    DF %>% group_by(z) %>% summarise_each(funs(sum, mean))

    # case (c)
    DT[, c(.N, lapply(.SD, sum)), by = z]
    DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
    • In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().

    • data.table's := requires column names to be provided, whereas dplyr generates it automatically.

    • In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.

    • In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.

    Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.

    You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .

  3. Joins

    dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent function as an alternative.

     setkey(DT1, x, y)

    # 1. normal join
    DT1[DT2] ## data.table syntax
    left_join(DT2, DT1) ## dplyr syntax

    # 2. select columns while join
    DT1[DT2, .(z, i.mul)]
    left_join(select(DT2, x, y, mul), select(DT1, x, y, z))

    # 3. aggregate while join
    DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
    DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
    inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)

    # 4. update while join
    DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]

    # 5. rolling join
    DT1[DT2, roll = -Inf]

    # 6. other arguments to control output
    DT1[DT2, mult = "first"]
  • Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.

  • However dplyr joins do just that. Nothing more. Nothing less.

  • data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.

  • data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?

  • data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.

  • data.table also has mult = argument which selects first, last or all matches (6).

  • data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

Once again, the syntax is consistent with DT[i, j, by] with additional arguments allowing for controlling the output further.

  1. do()...

    dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.

     DT[, list(x[1], y[1]), by = z]                 ## data.table syntax
    DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
    DT[, list(x[1:2], y[1]), by = z]
    DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))

    DT[, quantile(x, 0.25), by = z]
    DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
    DT[, quantile(x, c(0.25, 0.75)), by = z]
    DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))

    DT[, as.list(summary(x)), by = z]
    DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
  • .SD's equivalent is .

  • In data.table, you can throw pretty much anything in j - the only thing to remember is for it to return a list so that each element of the list gets converted to a column.

  • In dplyr, cannot do that. Have to resort to do() depending on how sure you are as to whether your function would always return a single value. And it is quite slow.

Once again, data.table's syntax is consistent with DT[i, j, by]. We can just keep throwing expressions in j without having to worry about these things.

Have a look at this SO question and this one. I wonder if it would be possible to express the answer as straightforward using dplyr's syntax...

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

data.table has its quirks as well (some of which I have pointed out that we are attempting to fix). We are also attempting to improve data.table's joins as I have highlighted here.

But one should also consider the number of features that dplyr lacks in comparison to data.table.

4. Features

I have pointed out most of the features here and also in this post. In addition:

  • fread - fast file reader has been available for a long time now.

  • fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.

  • Automatic indexing - another handy feature to optimise base R syntax as is, internally.

  • Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.

  • Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.

  • Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.

  • Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.

  • setorder() function in data.table that allows really fast reordering of data.tables by reference.

  • dplyr provides interface to databases using the same syntax, which data.table does not at the moment.

  • data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).

  • data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


  • On databases - there is no reason why data.table cannot provide similar interface, but this is not a priority now. It might get bumped up if users would very much like that feature.. not sure.

  • On parallelism - Everything is difficult, until someone goes ahead and does it. Of course it will take effort (being thread safe).

    • Progress is being made currently (in v1.9.7 devel) towards parallelising known time consuming parts for incremental performance gains using OpenMP.

data.table vs dplyr memory use revisited

UPDATE: Following @jangorecki's suggestion, I redid the analysis using the cgmemtime shell utility. The numbers are far closer — even with multithreading enabled — and data.table now edges out dplyr w.r.t to .high-water RSS+CACHE memory usage.


$ ./cgmemtime Rscript ~/mem-comp-dplyr.R
Child user: 0.526 s
Child sys : 0.033 s
Child wall: 0.455 s
Child high-water RSS : 128952 KiB
Recursive and acc. high-water RSS+CACHE : 118516 KiB


$ ./cgmemtime Rscript ~/mem-comp-dt.R
Child user: 0.510 s
Child sys : 0.056 s
Child wall: 0.464 s
Child high-water RSS : 129032 KiB
Recursive and acc. high-water RSS+CACHE : 118320 KiB

Bottom line: Accurately measuring memory usage from within R is complicated.

I'll leave my original answer below because I think it still has value.


Okay, so in the process of writing this out I realised that data.table's default multi-threading behaviour appears to be the major culprit. If I re-run the latter chunk, but this time turn of multi-threading, the two results are much more comparable:

library(dplyr, warn.conflicts = FALSE)
library(data.table, warn.conflicts = FALSE)

DT = data.table(x = rep(1:10, times = 1e5),
y = sample(LETTERS[1:10], 10e5, replace = TRUE),
z = rnorm(1e6))

DT[x > 7, mean(z), by = y]
#> y V1
#> 1: F -0.0056834238
#> 2: I -0.0016755202
#> 3: J 0.0066061660
#> 4: G -0.0034436348
#> 5: B -0.0070242788
#> 6: E -0.0070462070
#> 7: H 0.0005525803
#> 8: D -0.0043024627
#> 9: A -0.0033609302
#> 10: C 0.0029146372

#> current max
#> 589MB 612MB

Created on 2020-04-22 by the reprex package (v0.3.0)

Still, I'm surprised that they're this close. The data.table memory performance actually gets comparably worse if I try with a larger data set — despite using a single thread — which makes me suspicious that I'm still not measuring memory usage correctly...

Data.table instead of dplyr

I think akrun's answer provides a great expression-for-expression translation. If you don't need the steps replicated, though, you can try this:

dt1 <-
cols <- grep("^DRM", colnames(dt1), value = TRUE)
dt1_results_median <-
dt1[, (paste0(cols, "_PV")) := DR1 - .SD, .SDcols = cols
][, lapply(.SD, median), by = .(Id, Category, Week), .SDcols = paste0(cols, "_PV") ]

Relative performance, showing a small improvement (33-41%) in both answers:

bench::mark(OP = {...}, akrun = {...}, r2evans = {...}, check = FALSE, iterations = 10)
# Warning: Some expressions had a GC in every iteration; so filtering is disabled.
# # A tibble: 3 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <list>
# 1 OP 1.61s 1.72s 0.583 2.01GB 3.15 10 54 17.2s <NULL> <Rprofmem~ <benc~ <tibbl~
# 2 akrun 1.24s 1.29s 0.773 2.29GB 1.47 10 19 12.9s <NULL> <Rprofmem~ <benc~ <tibbl~
# 3 r2evans 1.19s 1.21s 0.823 1.88GB 1.65 10 20 12.2s <NULL> <Rprofmem~ <benc~ <tibbl~

I recognize that akrun's answer may be intended more as a teaching moment, translating from one dialect of R to another, so the speed difference is "moot". At this comparative run-time, I think it's more important to go with the code that is more readable and understandable, which makes it more maintainable and easier to troubleshoot if/when you need to change your methods.

when is plyr better than data.table?

They are different packages with different purposes. One is not a substitute for the other, despite there being a small subset of functionality for which they overlap.

Here is the brief summary of each package, from the packages themselves:

The plyr package is a set of clean and consistent tools that implement the split-apply-combine pattern in R. This is an extremely common pattern in data analysis: you solve a complex problem by breaking it down into small pieces, doing something to each piece and then combining the results back together again.


data.table ... offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by A[B] syntax in R where A is a matrix and B is a 2-column matrix.

Where they overlap is in the "fast grouping" which plyr also does by splitting data.frames, operating on pieces, and recombining them into a single data.frame. data.table has many other features which make operations on data.frame like structures fast; plyr has features which apply the split-apply-combine paradigm to other data structures such as lists and arrays (both as inputs and outputs).

So, really, they are two different tools that happen to have a small area of overlap which address the same problem domain, but each does much more than that and if you want/need that additional functionality, then that package should be used.

Dangers of mixing [tidyverse] and [data.table] syntax in R?

I came across the same problem on a few occasions, which led me to avoid mixing dplyr with data.table syntax, as I didn't take the time to find out the reason. So thanks for providing a MRE.

Looks like dplyr::arrange is interfering with data.table auto-indexing :

  • index will be used when subsetting dataset with ==
    or %in%
    on a single variable
  • by default if index for a variable is not present on filtering, it is automatically created and used
  • indexes are lost if you change the order of data
  • you can check if you are using index with options(datatable.verbose=TRUE)

If we explicitely set auto-indexing :


DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ALB Albania UMIC
MOZ Mozambique LIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options( = TRUE)

DT <- distinct(DT) %>%

# Index creation because %in% is used for the first time
DT[iso3c %in% codes,verbose=T]
#> Creating new index 'iso3c'
#> Creating index iso3c done in ... forder.c received 3 rows and 3 columns
#> forder took 0 sec
#> 0.060s elapsed (0.060s cpu)
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> Reordering 2 rows after bmerge done in ... forder.c received a vector type 'integer' length 2
#> 0 secs
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC

# Index mixed up by arrange
DT <- DT %>% arrange(iso3c) %>%

# this is wack because data.table possibly still uses the old index whereas row/references were rearranged:
DT[iso3c %in% codes,verbose=T]
#> Optimized subsetting with index 'iso3c'
#> forder.c received 2 rows and 1 columns
#> forder took 0 sec
#> x is already ordered by these columns, no need to call reorder
#> i.iso3c has same type (character) as x.iso3c. No coercion needed.
#> on= matches existing index, using index
#> Starting bmerge ...
#> bmerge done in 0.000s elapsed (0.000s cpu)
#> Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
#> iso3c country income
#> 1: ALB Albania UMIC

# this works because (...) prevents the parser to use auto-index
DT[(iso3c %in% codes)]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC

To avoid this problem, you can disable auto-indexing:


DT <- fread(
"iso3c country income
MOZ Mozambique LIC
ALB Albania UMIC
MOZ Mozambique LIC
ALB Albania UMIC")
codes <- c("ALB", "ZMB")

options( = FALSE) # Disabled

DT <- distinct(DT) %>%

# No automatic index creation
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ZMB Zambia LMIC
#> 2: ALB Albania UMIC

DT <- DT %>% arrange(iso3c) %>%

# This now works because auto-indexing is off:
DT[iso3c %in% codes,verbose=T]
#> iso3c country income
#> 1: ALB Albania UMIC
#> 2: ZMB Zambia LMIC

I reported this issue on data.table/issues/5042 and on dtplyr/issues/259 : integrated in 1.4.11 milestone.

data.table or dplyr - data manipulation

df %.%
arrange(Date) %.%
filter(!duplicated(Col1)) %.%
group_by(Date) %.%
summarise(Count=n()) %.% # n() <=> length(Date)
mutate(Count = cumsum(Count))
# Source: local data frame [3 x 2]
# Date Count
# 1 2014-01-01 3
# 2 2014-01-02 5
# 3 2014-01-03 6

dt <- data.table(df, key="Date")
dt <- unique(dt, by="Col1")
(dt <- dt[, list(Count=.N), by=Date][, Count:=cumsum(Count)])
# Date Count
# 1: 2014-01-01 3
# 2: 2014-01-02 5
# 3: 2014-01-03 6


dt <- data.table(df, key="Date")
dt <- unique(dt, by="Col1")
dt[, .N, by=Date][, Count:=cumsum(N)]

.N is named N (no dot) automatically for convenience in chained operations like this, so you can use both .N and N together in the next operation if need be.

Related Topics

Leave a reply