What You Can Do with a Data.Frame That You Can't with a Data.Table

What you can do with a data.frame that you can't with a data.table?

From the data.table FAQ

FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?

As FAQ 1.1 highlights, j in [.data.table is fundamentally
different from j in [.data.frame. Even something as simple as
DF[,1] would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).

Furthermore, data.table inherits from data.frame. It is a
data.frame, too. A data.table can be passed to any package that
only accepts data.frame and that package can use [.data.frame
syntax on the data.table.

We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :

unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated in unique.c.

A second proposal was to use memcpy in duplicate.c, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.

What are the smaller syntax differences between `data.frame` and data.table

DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column

DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)

For this reason we say the comma is optional in DT, but not optional in DF

DT[[3]] == DF[, 3] == DF[[3]]

DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.

DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default

DT[ , "colA"][[1]] == DF[ , "colA"].

DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)

DT[ , list(colA)] == DF[ , "colA", drop = FALSE]

DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.

DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns
NA rows for each NA

DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]

data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.

check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.

stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.

Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.

Small caveat

There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.

For example

see this question and prompt response
From the NEWS for v 1.8.2

base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.

An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.

order() in data.frame and data.table

When used inside of a data.table operation, order(..) uses data.table:::forder. According to the Introduction to data.table vignette:

order() is internally optimised
We can use "-" on a character columns within the frame of a data.table to sort in decreasing order.
In addition, order(...) within the frame of a data.table uses data.table's internal fast radix order forder(). This sort provided such a compelling improvement over R's base::order that the R project adopted the data.table algorithm as its default sort in 2016 for R 3.3.0, see ?sort and the R Release NEWS.

The key to see the difference is that it uses a "fast radix order". If you see base::order, though, it has an argument method= which

  method: the method to be used: partial matches are allowed.  The
          default ('"auto"') implies '"radix"' for short numeric
          vectors, integer vectors, logical vectors and factors.
          Otherwise, it implies '"shell"'.  For details of methods
          '"shell"', '"quick"', and '"radix"', see the help for 'sort'.

Since the second column of your data.table is not one of numeric, integer, logical, or factor, then base::order uses the "shell" method for sorting, which produces different results.

However, if we force base::order to use method="radix", we get the same result.

order(A$two)
# [1] 1 2 3
order(A$two, method="radix")
# [1] 2 1 3

A[order(A$one, A$two, method = "radix"),]
#   one   two
# 2   k 31_60
# 1   k  3_28
# 3   k 48_68

You can affect the same ordering by using base::order:

B[base::order(B$one,B$two),]
#       one    two
#    <char> <char>
# 1:      k   3_28
# 2:      k  31_60
# 3:      k  48_68

transform data.table in r

For a data.table solution using melt then tstrsplit:

setDT(data1)
melt(data1, value.name="event",)[, c("x", "y") := tstrsplit(variable, "_")][,.(x,y,event)]
        x         y event
 1:  long customers  TRUE
 2:  long customers FALSE
 3:  long customers FALSE
 4:  long customers  TRUE
 5:  long  partners FALSE
 6:  long  partners  TRUE
 7:  long  partners FALSE
 8:  long  partners FALSE
 9: short customers FALSE
10: short customers  TRUE
11: short customers  TRUE
12: short customers FALSE
13: short  partners FALSE
14: short  partners FALSE
15: short  partners    NA
16: short  partners    NA

Intersect 2 columns in data.table R

I used this command, and it seems working :

mapply(function(x, y) paste0(intersect(x, y), collapse = " "), strsplit(data$col1, '\\s'), strsplit(data$col2, '\\s'))

What You Can Do with a Data.Frame That You Can't with a Data.Table