how to suppress output when using `:=` in R {data.table}, prior to v1.8.3?
Since <-.data.table
doesn't make a copy, you can use <-
:
Create a data.table object:
library(data.table)
di <- data.table(iris)
Create a new column:
di <- di[, z:=1:nrow(di)]
di
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species z
# [1,] 5.1 3.5 1.4 0.2 setosa 1
# [2,] 4.9 3.0 1.4 0.2 setosa 2
# [3,] 4.7 3.2 1.3 0.2 setosa 3
# [4,] 4.6 3.1 1.5 0.2 setosa 4
# [5,] 5.0 3.6 1.4 0.2 setosa 5
# [6,] 5.4 3.9 1.7 0.4 setosa 6
# [7,] 4.6 3.4 1.4 0.3 setosa 7
# [8,] 5.0 3.4 1.5 0.2 setosa 8
# [9,] 4.4 2.9 1.4 0.2 setosa 9
# [10,] 4.9 3.1 1.5 0.1 setosa 10
# First 10 rows of 150 printed.
It is also worth remembering that R only prints the value of an object in interactive mode.
So, in batch mode, you can simply use:
di[, z:=1:nrow(di)]
This will not produce any output when run as a script in batch mode.
Further info from Matthew Dowle:
Also see FAQ 2.21 and 2.22 :
2.21 Why does
DT[i,col:=value]
return the whole ofDT
? I expected either no visible value (consistent with<-
), or a message or return value containing how many rows were updated. It isn't obvious that the data has indeed been updated by reference.So that compound syntax can work; e.g.,
DT[i,done:=TRUE][,sum(done)]
. The number of rows updated is returned when verbosity is on, either on a per query basis or globally usingoptions(datatable.verbose=TRUE)
.2.22 Ok, but can't the return value of
DT[i,col:=value]
be returned invisibly, then?
- We tried to but R internally forces visibility on for
[
. The value of
FunTab's eval column (see src/main/names.c) for[
is0
meaning force
R_Visible
on (see R-Internals section 1.6). Therefore, when we tried
invisible()
or settingR_Visible
to0
directly ourselves,eval
in
src/main/eval.c would force it on again.- After getting used to this behaviour, you might grow to prefer it (we have). After all, how many times do we subassign using
<-
and then immediately look at the data to check it's ok?- We can mix
:=
into aj
which also returns data; a mixed update and select in one query. To detect whetherj
solely updates (and then behave dierently) could be confusing.
Second update from Matthew Dowle:
We have now found a solution and v1.8.3 no longer prints the result when :=
is used. We will update FAQ 2.21 and 2.22.
knitr gets tricked by data.table `:=` assignment
Update Oct 2014. Now in data.table v1.9.5 :
:=
no longer prints inknitr
for consistency with behaviour at the prompt, #505. Output of a testknit("knitr.Rmd")
is now in data.table's unit tests.
and related :
if (TRUE) DT[,LHS:=RHS]
now doesn't print (thanks to Jureiss, #869). Test added. To get this to work we've had to live with one downside: if a:=
is used inside a function with noDT[]
before the end of the function, then the next timeDT
is typed at the prompt, nothing will be printed. A repeatedDT
will print. To avoid this: include aDT[]
after the last:=
in your function. If that is not possible (e.g., it's not a function you can change) thenprint(DT)
andDT[]
at the prompt are guaranteed to print. As before, adding an extra[]
on the end of a:=
query is a recommended idiom to update and then print; e.g.> DT[,foo:=3L][]
Previous answer kept for posterity (the global$depthtrigger
business is no longer done as from data.table v1.9.5 so this is no longer true) ...
Just to be clear I understand then: knitr
is printing when you don't want it to.
Try increasing data.table:::.global$depthtrigger
a little bit at the start of the script.
This will be 3 for you currently :
data.table:::.global$depthtrigger
[1] 3
I don't know how much eval depth knitr
adds to the stack. But try changing the trigger to 4 first; i.e.
assign("depthtrigger", 4, data.table:::.global)
and at the end of the knitr
script ensure to set it back to 3. If 4 doesn't work, try 5, then 6. If you get to 10 give up and I'll think again. ;-P
Why might this work?
See NEWS from v1.8.4 :
DT[,LHS:=RHS,...]
no longer printsDT
. This implements #2128 "Try
again to getDT[i,j:=value]
to return invisibly". Thanks to discussions here :
how to suppress output when using `:=` in R {data.table}, prior to v1.8.3?
http://r.789695.n4.nabble.com/Avoiding-print-when-using-tp4643076.html
FAQs 2.21 and 2.22 have been updated.FAQ 2.21 Why does DT[i,col:=value] return the whole of DT? I expected either no visible value (consistent with <-), or a message or return
value containing how many rows were updated. It isn't obvious that the
data has indeed been updated by reference.
This has changed in v1.8.3
to meet your expectations. Please upgrade. The whole of DT is returned
(now invisibly) so that compound syntax can work; e.g.,
DT[i,done:=TRUE][,sum(done)]. The number of rows updated is returned
when verbosity is on, either on a per query basis or globally using
options(datatable.verbose=TRUE).FAQ 2.22 Ok, thanks. What was so difficult about the result of DT[i,col:=value] being returned invisibly?
R internally forces
visibility on for [. The value of FunTab's eval column (see
src/main/names.c) for [ is 0 meaning force R_Visible on (see
R-Internals section 1.6). Therefore, when we tried invisible() or
setting R_Visible to 0 directly ourselves, eval in src/main/eval.c
would force it on again. To solve this problem, the key was to stop
trying to stop the print method running after a :=. Instead, inside :=
we now (from v1.8.3) set a global flag which the print method uses to
know whether to actually print or not.
That global flag is data.table:::.global$print
. At the top of data.table:::print.data.table
you'll see it looking at it. That's because there is no known way to suppress printing from [
(as FAQ 2.22 explains).
So, inside :=
inside [.data.table
it looks to see how "deep" this call is :
if (Cstack_info()[["eval_depth"]] <= .global$depthtrigger) {
suppPrint = function(x) { .global$print=FALSE; x }
# Suppress print when returns ok not on error, bug #2376.
# Thanks to: https://stackoverflow.com/a/13606880/403310
# All appropriate returns following this point are
# wrapped i.e. return(suppPrint(x)).
}
Essential that's just saying: if DT[,x:=y]
is running at the prompt, then I know the REPL is going to call the print
method on my result, beyond my control. Ok, so given print
method is going to run, I'm going to suppress it inside that print
method by setting a flag (since the print
method that runs (i.e. print.data.table
) is something I can control).
In knitr
's case it's simulating the REPL in a clever way. It isn't really a script, iiuc, otherwise DT[,x:=y]
wouldn't print anyway for that reason. But because it's simulating REPL via an eval
there is an extra level of eval
depth for code run from knitr
. Or something similar (I don't know knitr
).
Which is why I'm thinking increasing the depthtrigger
might do the trick.
Hacky/crufty, I agree. But if it works, and you let me know which value works, I can change data.table
to be knitr
aware and change the depthtrigger
automatically. Or any better solutions are most welcome.
Writings functions (procedures) for data.table objects
Yes, the addition, modification, deletion of columns in data.table
s is done by reference
. In a sense, it is a good thing because a data.table
usually holds a lot of data, and it would be very memory and time consuming to reassign it all every time a change to it is made. On the other hand, it is a bad thing because it goes against the no-side-effect
functional programming approach that R tries to promote by using pass-by-value
by default. With no-side-effect programming, there is little to worry about when you call a function: you can rest assured that your inputs or your environment won't be affected, and you can just focus on the function's output. It's simple, hence comfortable.
Of course it is ok to disregard John Chambers's advice if you know what you are doing. About writing "good" data.tables procedures, here are a couple rules I would consider if I were you, as a way to limit complexity and the number of side-effects:
- a function should not modify more than one table, i.e., modifying that table should be the only side-effect,
- if a function modifies a table, then make that table the output of the function. Of course, you won't want to re-assign it: just run
do.something.to(table)
and nottable <- do.something.to(table)
. If instead the function had another ("real") output, then when callingresult <- do.something.to(table)
, it is easy to imagine how you may focus your attention on the output and forget that calling the function had a side effect on your table.
While "one output / no-side-effect" functions are the norm in R, the above rules allow for "one output or side-effect". If you agree that a side-effect is somehow a form of output, then you'll agree I am not bending the rules too much by loosely sticking to R's one-output functional programming style. Allowing functions to have multiple side-effects would be a little more of a stretch; not that you can't do it, but I would try to avoid it if possible.
understanding the reference properties of data.table in R
When you create newDT
in the second example, you are evaluating i
(not j
). :=
assigns by reference within the j
argument. There are no equivalents in the i
statement, as the self reference over allocates the columns, but not the rows.
A data.table
is a list. It has length == the number of columns, but is over allocated so you can add more columns without copying the entire table (eg using :=
in j
)
If we inspect the data.table, then we can see the truelength
(tl = 100
) -- that is the numbe of column pointer slots
.Internal(inspect(DT))
@1427d6c8 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=1, tl=100)
@b249a30 13 INTSXP g0c4 [NAM(2)] (len=10, tl=0) 1,2,3,4,5,...
Within the data.table each element has length 10
, and tl=0
. Currently there is no method to increase the truelength
of the columns to allow appending extra rows by reference.
From ?truelength
Currently, it's just the list vector of column pointers that is
over-allocated (i.e. truelength(DT)), not the column vectors
themselves, which would in future allow fast row insert()
When you evaluate i
, data.table
doesn't check whether you have simply returned all rows in the same order as in the original (and then not copy only in that case), it simply returns the copy.
Get a number after a word pattern in R
We could use sub
to capture the digits (\\d+
) after the 'rating' including characters :
or spaces, and convert to numeric
with as.numeric
library(data.table)
y[, num := as.numeric(sub(".*rating[^0-9]*(\\d+)\\b.*", "\\1",
status, ignore.case = TRUE))]
y
# status num
#1: client rating 01 approved 1
#2: John Rating: 2 reproved 2
#3: Customer rating9 9
Assign multiple columns using := in data.table, by group
This now works in v1.8.3 on R-Forge. Thanks for highlighting it!
x <- data.table(a = 1:3, b = 1:6)
f <- function(x) {list("hi", "hello")}
x[ , c("col1", "col2") := f(), by = a][]
# a b col1 col2
# 1: 1 1 hi hello
# 2: 2 2 hi hello
# 3: 3 3 hi hello
# 4: 1 4 hi hello
# 5: 2 5 hi hello
# 6: 3 6 hi hello
x[ , c("mean", "sum") := list(mean(b), sum(b)), by = a][]
# a b col1 col2 mean sum
# 1: 1 1 hi hello 2.5 5
# 2: 2 2 hi hello 3.5 7
# 3: 3 3 hi hello 4.5 9
# 4: 1 4 hi hello 2.5 5
# 5: 2 5 hi hello 3.5 7
# 6: 3 6 hi hello 4.5 9
mynames = c("Name1", "Longer%")
x[ , (mynames) := list(mean(b) * 4, sum(b) * 3), by = a]
# a b col1 col2 mean sum Name1 Longer%
# 1: 1 1 hi hello 2.5 5 10 15
# 2: 2 2 hi hello 3.5 7 14 21
# 3: 3 3 hi hello 4.5 9 18 27
# 4: 1 4 hi hello 2.5 5 10 15
# 5: 2 5 hi hello 3.5 7 14 21
# 6: 3 6 hi hello 4.5 9 18 27
x[ , get("mynames") := list(mean(b) * 4, sum(b) * 3), by = a][] # same
# a b col1 col2 mean sum Name1 Longer%
# 1: 1 1 hi hello 2.5 5 10 15
# 2: 2 2 hi hello 3.5 7 14 21
# 3: 3 3 hi hello 4.5 9 18 27
# 4: 1 4 hi hello 2.5 5 10 15
# 5: 2 5 hi hello 3.5 7 14 21
# 6: 3 6 hi hello 4.5 9 18 27
x[ , eval(mynames) := list(mean(b) * 4, sum(b) * 3), by = a][] # same
# a b col1 col2 mean sum Name1 Longer%
# 1: 1 1 hi hello 2.5 5 10 15
# 2: 2 2 hi hello 3.5 7 14 21
# 3: 3 3 hi hello 4.5 9 18 27
# 4: 1 4 hi hello 2.5 5 10 15
# 5: 2 5 hi hello 3.5 7 14 21
# 6: 3 6 hi hello 4.5 9 18 27
Older version using the with
argument (we discourage this argument when possible):
x[ , mynames := list(mean(b) * 4, sum(b) * 3), by = a, with = FALSE][] # same
# a b col1 col2 mean sum Name1 Longer%
# 1: 1 1 hi hello 2.5 5 10 15
# 2: 2 2 hi hello 3.5 7 14 21
# 3: 3 3 hi hello 4.5 9 18 27
# 4: 1 4 hi hello 2.5 5 10 15
# 5: 2 5 hi hello 3.5 7 14 21
# 6: 3 6 hi hello 4.5 9 18 27
Related Topics
Sort Matrix According to First Column in R
Multiple Lines for Text Per Legend Label in Ggplot2
How to Merge Two Data.Table by Different Column Names
Is There a Technical Difference Between "=" and "<-"
How to Convert Date and Time from Character to Datetime Type
How to Split a Data Frame by Rows, and Then Process the Blocks
How to Screenshot a Website Using R
How to Separately Control the X and Y Axes Using Ggplot
How to Give Color to Each Class in Scatter Plot in R
Combination Boxplot and Histogram Using Ggplot2
R: How to Total the Number of Na in Each Col of Data.Frame
Rank Variable by Group (Dplyr)
How to Rank Within Groups in R
How to Suppress Output When Using ':=' in R {Data.Table}, Prior to V1.8.3