What you can do with a data.frame that you can't with a data.table?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights,
j
in[.data.table
is fundamentally
different fromj
in[.data.frame
. Even something as simple as
DF[,1]
would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).Furthermore,
data.table
inherits fromdata.frame
. It is a
data.frame
, too. Adata.table
can be passed to any package that
only acceptsdata.frame
and that package can use[.data.frame
syntax on thedata.table
.We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :
unique()
andmatch()
are now faster on character vectors where all elements are in the globalCHARSXP
cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated inunique.
c.
A second proposal was to use
memcpy
induplicate.c
, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame
and data.table
DT[3]
refers to the 3rd row, butDF[3]
refers to the 3rd columnDT[3, ] == DT[3]
, butDF[ , 3] == DF[3]
(somewhat confusingly in data.frame, whereas data.table is consistent)- For this reason we say the comma is optional in
DT
, but not optional inDF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ]
, wherei
is a single integer, returns a single row, just likeDF[i, ]
, but unlike a matrix single-row subset which returns a vector.DT[ , j]
wherej
is a single integer returns a one-column data.table, unlikeDF[, j]
which returns a vector by defaultDT[ , "colA"][[1]] == DF[ , "colA"]
.DT[ , colA] == DF[ , "colA"]
(currently in data.table v1.9.8 but is about to change, see release notes)DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA]
returns 1 row ofNA
, butDF[NA]
returns an entire copy ofDF
containingNA
throughout. The symbolNA
is typelogical
in R and is therefore recycled by[.data.frame
. The user's intention was probablyDF[NA_integer_]
.[.data.table
diverts to this probable intention automatically, for convenience.DT[c(TRUE, NA, FALSE)]
treats theNA
asFALSE
, butDF[c(TRUE, NA, FALSE)]
returns
NA
rows for eachNA
DT[ColA == ColB]
is simpler thanDF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4))
creates 3 columns, data.table creates onelist
column.check.names
is by defaultTRUE
indata.frame
butFALSE
in data.table, for convenience.stringsAsFactors
is by defaultTRUE
indata.frame
butFALSE
in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting tofactor
.- Atomic vectors in
list
columns are collapsed when printed using", "
indata.frame
, but","
in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In[.data.frame
we very often setdrop = FALSE
. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single columndata.frame
. In[.data.table
we took the opportunity to make it consistent and droppeddrop
.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table
is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
- base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.- An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.
order() in data.frame and data.table
When used inside of a data.table
operation, order(..)
uses data.table:::forder
. According to the Introduction to data.table vignette:
order() is internally optimised
We can use "-" on a
character
columns within the frame of adata.table
to sort in decreasing order.In addition,
order(...)
within the frame of adata.table
usesdata.table
's internal fast radix orderforder()
. This sort provided such a compelling improvement over R'sbase::order
that the R project adopted thedata.table
algorithm as its default sort in 2016 for R 3.3.0, see?sort
and the R Release NEWS.
The key to see the difference is that it uses a "fast radix order". If you see base::order
, though, it has an argument method=
which
method: the method to be used: partial matches are allowed. The
default ('"auto"') implies '"radix"' for short numeric
vectors, integer vectors, logical vectors and factors.
Otherwise, it implies '"shell"'. For details of methods
'"shell"', '"quick"', and '"radix"', see the help for 'sort'.
Since the second column of your data.table
is not one of numeric
, integer
, logical
, or factor
, then base::order
uses the "shell"
method for sorting, which produces different results.
However, if we force base::order
to use method="radix"
, we get the same result.
order(A$two)
# [1] 1 2 3
order(A$two, method="radix")
# [1] 2 1 3
A[order(A$one, A$two, method = "radix"),]
# one two
# 2 k 31_60
# 1 k 3_28
# 3 k 48_68
You can affect the same ordering by using base::order
:
B[base::order(B$one,B$two),]
# one two
# <char> <char>
# 1: k 3_28
# 2: k 31_60
# 3: k 48_68
transform data.table in r
For a data.table
solution using melt
then tstrsplit
:
setDT(data1)
melt(data1, value.name="event",)[, c("x", "y") := tstrsplit(variable, "_")][,.(x,y,event)]
x y event
1: long customers TRUE
2: long customers FALSE
3: long customers FALSE
4: long customers TRUE
5: long partners FALSE
6: long partners TRUE
7: long partners FALSE
8: long partners FALSE
9: short customers FALSE
10: short customers TRUE
11: short customers TRUE
12: short customers FALSE
13: short partners FALSE
14: short partners FALSE
15: short partners NA
16: short partners NA
Intersect 2 columns in data.table R
I used this command, and it seems working :
mapply(function(x, y) paste0(intersect(x, y), collapse = " "), strsplit(data$col1, '\\s'), strsplit(data$col2, '\\s'))
Related Topics
Collapse Continuous Integer Runs to Strings of Ranges
Set One or More of Coefficients to a Specific Integer
How to Change the Color in Geom_Point or Lines in Ggplot
Count How Many Values in Some Cells of a Row Are Not Na (In R)
How to Plot the Survival Curve Generated by Survreg (Package Survival of R)
Changing Factor Levels with Dplyr Mutate
R: Select Values from Data Table in Range
R: Split Unbalanced List in Data.Frame Column
R: Ggplot2 Barplot and Error Bar
How to Separate Comma Separated Values in R in a New Row
Ggplot Geom_Text Font Size Control
How to Delete the First Row of a Dataframe in R
Outputting Multiple Lines of Text with Rendertext() in R Shiny
R Matrix to Rownames Colnames Values
Recode Categorical Variable to Binary (0/1)
Differencebetween Cat and Print