Using setDT inside a function
Great question! The warning message should say: ... and fixed by taking a shallow copy of the whole table .... Will fix this.
setDT
does two things:
- set the class to
data.table
fromdata.frame
/list
- use
alloc.col
to over-allocate columns (so that:=
can be used directly)
And the 2nd step requires a shallow copy, if the input is not a data.table
already. And this is why we assign the value back to the symbol in it's environment (setDT's parent frame). But the parent frame for setDT
is your function f()
. Therefore the setDT(df)
within your function has gone through smoothly, but the df
that resides in the global environment will only have it's class changed, not the over-allocation (as the shallow copy severed the link).
And in the next step, :=
detects that and shallow copies once again to over-allocate.
The idea so far is to use setDT
to convert to data.tables before providing it to a function. But I'd like that these cases be resolved (will take a look).
Thanks a bunch!
When should I use setDT() instead of data.table() to create a data.table?
Update:
@Roland makes some good points in the comments section, and the post is better for them. While I originally focused on memory overflow issues, he pointed out that even if this doesn't happen, memory management of various copies takes substantial time, which is a more common everyday concern. Examples of both issues have now been added as well.
I like this question on stackoverflow because I think it is really about avoiding stack overflow in R when dealing with larger data sets. Those who are unfamiliar with data.table
family of set
operations may benefit from this discussion!
One should use setDT()
when working with larger data sets that take up a considerable amount of RAM because the operation will modify each object in place, conserving memory. For data that is a very small percentage of RAM, using data.table’s copy-and-modify is fine.
The creation of the setDT
function was actually inspired by the following thread on stack overflow, which is about working with a large data set (several GB's). You will see Matt Dowle chime in an suggest the 'setDT' name.
Convert a data frame to a data.table without copy
A bit more depth:
With R, data is stored in memory. This speeds things up considerably because RAM is much faster to access than storage devices. However, a problem can arise when one’s data set is a large portion of RAM. Why? Because base R has a tendency to make copies of each data.frame
when some operations are applied to them. This has improved after version 3.1, but addressing that is beyond the scope of this post. If one is pulling multiple data.frame
s or list
s into one data.frame
or data.table
, your memory usage will expand rather quickly because at some point during the operation, multiple copies of your data exist in RAM. If the data set is big enough, you may run out of memory when all the copies are produced, and your stack will overflow. See example of this below. We get an error, and original memory address and class of object does not change.
> N <- 1e8
> P <- 1e2
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
>
> pryr::object_size(data)
800 MB
>
> tracemem(data)
[1] "<0000000006D2DF18>"
>
> data <- data.table(data)
Error: cannot allocate vector of size 762.9 Mb
>
> tracemem(data)
[1] "<0000000006D2DF18>"
> class(data)
[1] "data.frame"
>
The ability to just modify the object in place without copying is a big deal. That is what setDT
does when it takes a list
or data.frame
and returns a data.table
. The same example as above using setDT
, now works fine and without error. Both class and memory address change, and no copies take place.
> tracemem(data)
[1] "<0000000006D2DF18>"
> class(data)
[1] "data.frame"
>
> setDT(data)
>
> tracemem(data)
[1] "<0000000006A8C758>"
> class(data)
[1] "data.table" "data.frame"
@Roland points out that for most people, the bigger concern is speed, which suffers as a side effect of such intensive use of memory management. Here is an example with smaller data that does not crash the cpu, and illustrates how much faster setDT
is for this job. Notice the results of 'tracemem' in the wake of data <- data.table(data)
, making copies of data
. Contrast that with setDT(data)
which doesn't print a single copy. We have to then call tracemem(data)
to see the new memory address.
> N <- 1e5
> P <- 1e2
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
> pryr::object_size(data)
808 kB
> # data.table method
> tracemem(data)
[1] "<0000000019098438>"
> data <- data.table(data)
tracemem[0x0000000019098438 -> 0x0000000007aad7d8]: data.table
tracemem[0x0000000007aad7d8 -> 0x0000000007c518b8]: copy as.data.table.data.frame as.data.table data.table
tracemem[0x0000000007aad7d8 -> 0x0000000018e454c8]: as.list.data.frame as.list vapply copy as.data.table.data.frame as.data.table data.table
> class(data)
[1] "data.table" "data.frame"
>
> # setDT method
> # back to data.frame
> data <- as.data.frame(data)
> class(data)
[1] "data.frame"
> tracemem(data)
[1] "<00000000125BE1A0>"
> setDT(data)
> tracemem(data)
[1] "<00000000125C2840>"
> class(data)
[1] "data.table" "data.frame"
>
How does this impact timing? As we can see, setDT
is much faster for it.
> # timing example
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
> microbenchmark(setDT(data), data <- data.table(data))
Unit: microseconds
expr min lq mean median max neval uq
setDT(data) 49.948 55.7635 69.66017 73.553 100.238 100 79.198
data <- data.table(data) 54594.289 61238.8830 81545.64432 64179.131 611632.427 100 68647.917
Set functions can be used in many areas, not just when converting objects to a data.tables. You can find more information on the reference semantics and how to apply them elsewhere by calling the vignette on the subject.
library(data.table)
vignette("datatable-reference-semantics")
This is a great question and those thinking of using R with larger data sets or who just want to speed up data manipulation actives, can benefit from being familiar with the significant performance improvements of data.table
reference semantics.
Error in setDT(dat): while creating a function
You can use get
function to fix your problem, like below. The problem is that you are using var1 and var2 parameter as strings, which is not getting translated correctly inside your function. You may use , parse with eval
(NSE functions) to fix this or you can use get
.
Single_chile<-function(data,var1,var2){
Tab <- dcast(data, get(var1) ~ get(var2), fun.aggregate = length)
Tab1<-Tab%>% mutate("Todo el Mercado"=rowSums(Tab[,2:ncol(Tab)]))
ALL <- as.list( c( var1 = "Número_de_Respuestas", colSums(Tab1[, 2:ncol(Tab1)]) ) )
Tab1[, 2:ncol(Tab1)]<- sapply(Tab1[, 2:ncol(Tab1)],prop.table)
Tab1[, 2:ncol(Tab1)]<- sapply(Tab1[, 2:ncol(Tab1)],function(x) paste0(round(x*100,0), "%"))
Tab2 <- rbindlist(l = list(Tab1, ALL))
Tab2
}
Single_chile(test,"Q27","Q12_1_TEXT")
I hope this solves your problem.
Thanks
lapply data.table setDT in nested lists does not work or is not idempotent?
I struggled a bit getting lapply to work too. I could get it to turn the data frames into data tables, but it refused to keep the row names.
I found a simple double loop works. It's probably making copies of the data frames before overwriting them, so I don't know if this will be fast enough for your needs. It seems to take about 6 milliseconds on your data using my machine.
for(i in 1:3)
for(j in 1:2)
top_list[[i]][[j]] <- as.data.table(top_list[[i]][[j]], keep.rownames = "Sample")
This gives
top_list
#> $`aa`
#> $`aa`$`train_set`
#> Sample x y z
#> 1: Observation_1 2 0 Factor1
#> 2: Observation_2 2 1 Factor1
#> 3: Observation_3 2 2 Factor1
#> 4: Observation_4 2 3 Factor1
#> 5: Observation_5 2 4 Factor1
#> 6: Observation_6 2 5 Factor1
#> 7: Observation_7 2 6 Factor1
#> 8: Observation_8 2 7 Factor1
#> 9: Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#>
#> $`aa`$test_set
#> Sample x y z
#> 1: Observation_1 1 0 Factor2
#> 2: Observation_2 1 1 Factor2
#> 3: Observation_3 1 2 Factor2
#> 4: Observation_4 1 3 Factor2
#> 5: Observation_5 1 4 Factor2
#> 6: Observation_6 1 5 Factor2
#> 7: Observation_7 1 6 Factor2
#> 8: Observation_8 1 7 Factor2
#> 9: Observation_9 1 8 Factor2
#> 10: Observation_10 1 9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2
#>
#>
#> $bb
#> $bb$`train_set`
#> Sample x y z
#> 1: Observation_1 2 0 Factor1
#> 2: Observation_2 2 1 Factor1
#> 3: Observation_3 2 2 Factor1
#> 4: Observation_4 2 3 Factor1
#> 5: Observation_5 2 4 Factor1
#> 6: Observation_6 2 5 Factor1
#> 7: Observation_7 2 6 Factor1
#> 8: Observation_8 2 7 Factor1
#> 9: Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#>
#> $bb$test_set
#> Sample x y z
#> 1: Observation_1 1 0 Factor2
#> 2: Observation_2 1 1 Factor2
#> 3: Observation_3 1 2 Factor2
#> 4: Observation_4 1 3 Factor2
#> 5: Observation_5 1 4 Factor2
#> 6: Observation_6 1 5 Factor2
#> 7: Observation_7 1 6 Factor2
#> 8: Observation_8 1 7 Factor2
#> 9: Observation_9 1 8 Factor2
#> 10: Observation_10 1 9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2
#>
#>
#> $cc
#> $cc$`train_set`
#> Sample x y z
#> 1: Observation_1 2 0 Factor1
#> 2: Observation_2 2 1 Factor1
#> 3: Observation_3 2 2 Factor1
#> 4: Observation_4 2 3 Factor1
#> 5: Observation_5 2 4 Factor1
#> 6: Observation_6 2 5 Factor1
#> 7: Observation_7 2 6 Factor1
#> 8: Observation_8 2 7 Factor1
#> 9: Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#>
#> $cc$test_set
#> Sample x y z
#> 1: Observation_1 1 0 Factor2
#> 2: Observation_2 1 1 Factor2
#> 3: Observation_3 1 2 Factor2
#> 4: Observation_4 1 3 Factor2
#> 5: Observation_5 1 4 Factor2
#> 6: Observation_6 1 5 Factor2
#> 7: Observation_7 1 6 Factor2
#> 8: Observation_8 1 7 Factor2
#> 9: Observation_9 1 8 Factor2
#> 10: Observation_10 1 9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2
Can we setDT on multiple object all at once?
You can Filter
data.frame
s in an environment and apply setDT
to those:
all_data_tables = Filter(function(x) is.data.frame(eval(as.name(x))), ls())
lapply(all_data_tables, function(x) setDT(eval(as.name(x))))
You can also potentially replace is.data.frame
with is.list
or something more complicated, but I think is.data.frame
covers your use case.
You can also use get
and could also be more careful about specifying envir
in ls
/eval
/get
.
Related Topics
Displaying a Greater Than or Equal Sign
Subset Based on Variable Column Name
Get All Diagonal Vectors from Matrix
Setting Function Defaults R on a Project Specific Basis
Assigning Dates to Fiscal Year
Create End of the Month Date from a Date Variable
Speed Up Plot() Function for Large Dataset
How to Plot a Hybrid Boxplot: Half Boxplot with Jitter Points on the Other Half
Alternative to R's 'Memory.Size()' in Linux
Fill Region Between Two Loess-Smoothed Lines in R with Ggplot
Include Space for Missing Factor Level Used in Fill Aesthetics in Geom_Boxplot