Using Setdt Inside a Function

Using setDT inside a function

Great question! The warning message should say: ... and fixed by taking a shallow copy of the whole table .... Will fix this.

setDT does two things:

  • set the class to data.table from data.frame/list
  • use alloc.col to over-allocate columns (so that := can be used directly)

And the 2nd step requires a shallow copy, if the input is not a data.table already. And this is why we assign the value back to the symbol in it's environment (setDT's parent frame). But the parent frame for setDT is your function f(). Therefore the setDT(df) within your function has gone through smoothly, but the df that resides in the global environment will only have it's class changed, not the over-allocation (as the shallow copy severed the link).

And in the next step, := detects that and shallow copies once again to over-allocate.

The idea so far is to use setDT to convert to data.tables before providing it to a function. But I'd like that these cases be resolved (will take a look).

Thanks a bunch!

When should I use setDT() instead of data.table() to create a data.table?

Update:

@Roland makes some good points in the comments section, and the post is better for them. While I originally focused on memory overflow issues, he pointed out that even if this doesn't happen, memory management of various copies takes substantial time, which is a more common everyday concern. Examples of both issues have now been added as well.

I like this question on stackoverflow because I think it is really about avoiding stack overflow in R when dealing with larger data sets. Those who are unfamiliar with data.table family of set operations may benefit from this discussion!

One should use setDT() when working with larger data sets that take up a considerable amount of RAM because the operation will modify each object in place, conserving memory. For data that is a very small percentage of RAM, using data.table’s copy-and-modify is fine.

The creation of the setDT function was actually inspired by the following thread on stack overflow, which is about working with a large data set (several GB's). You will see Matt Dowle chime in an suggest the 'setDT' name.

Convert a data frame to a data.table without copy

A bit more depth:

With R, data is stored in memory. This speeds things up considerably because RAM is much faster to access than storage devices. However, a problem can arise when one’s data set is a large portion of RAM. Why? Because base R has a tendency to make copies of each data.frame when some operations are applied to them. This has improved after version 3.1, but addressing that is beyond the scope of this post. If one is pulling multiple data.frames or lists into one data.frame or data.table, your memory usage will expand rather quickly because at some point during the operation, multiple copies of your data exist in RAM. If the data set is big enough, you may run out of memory when all the copies are produced, and your stack will overflow. See example of this below. We get an error, and original memory address and class of object does not change.

> N <- 1e8
> P <- 1e2
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
>
> pryr::object_size(data)
800 MB
>
> tracemem(data)
[1] "<0000000006D2DF18>"
>
> data <- data.table(data)
Error: cannot allocate vector of size 762.9 Mb
>
> tracemem(data)
[1] "<0000000006D2DF18>"
> class(data)
[1] "data.frame"
>

The ability to just modify the object in place without copying is a big deal. That is what setDT does when it takes a list or data.frame and returns a data.table. The same example as above using setDT, now works fine and without error. Both class and memory address change, and no copies take place.

> tracemem(data)
[1] "<0000000006D2DF18>"
> class(data)
[1] "data.frame"
>
> setDT(data)
>
> tracemem(data)
[1] "<0000000006A8C758>"
> class(data)
[1] "data.table" "data.frame"

@Roland points out that for most people, the bigger concern is speed, which suffers as a side effect of such intensive use of memory management. Here is an example with smaller data that does not crash the cpu, and illustrates how much faster setDT is for this job. Notice the results of 'tracemem' in the wake of data <- data.table(data), making copies of data. Contrast that with setDT(data) which doesn't print a single copy. We have to then call tracemem(data) to see the new memory address.

> N <- 1e5
> P <- 1e2
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
> pryr::object_size(data)
808 kB

> # data.table method
> tracemem(data)
[1] "<0000000019098438>"
> data <- data.table(data)
tracemem[0x0000000019098438 -> 0x0000000007aad7d8]: data.table
tracemem[0x0000000007aad7d8 -> 0x0000000007c518b8]: copy as.data.table.data.frame as.data.table data.table
tracemem[0x0000000007aad7d8 -> 0x0000000018e454c8]: as.list.data.frame as.list vapply copy as.data.table.data.frame as.data.table data.table
> class(data)
[1] "data.table" "data.frame"
>
> # setDT method
> # back to data.frame
> data <- as.data.frame(data)
> class(data)
[1] "data.frame"
> tracemem(data)
[1] "<00000000125BE1A0>"
> setDT(data)
> tracemem(data)
[1] "<00000000125C2840>"
> class(data)
[1] "data.table" "data.frame"
>

How does this impact timing? As we can see, setDT is much faster for it.

> # timing example
> data <- as.data.frame(rep(data.frame(rnorm(N)), P))
> microbenchmark(setDT(data), data <- data.table(data))
Unit: microseconds
expr min lq mean median max neval uq
setDT(data) 49.948 55.7635 69.66017 73.553 100.238 100 79.198
data <- data.table(data) 54594.289 61238.8830 81545.64432 64179.131 611632.427 100 68647.917

Set functions can be used in many areas, not just when converting objects to a data.tables. You can find more information on the reference semantics and how to apply them elsewhere by calling the vignette on the subject.

library(data.table)    
vignette("datatable-reference-semantics")

This is a great question and those thinking of using R with larger data sets or who just want to speed up data manipulation actives, can benefit from being familiar with the significant performance improvements of data.table reference semantics.

Error in setDT(dat): while creating a function

You can use get function to fix your problem, like below. The problem is that you are using var1 and var2 parameter as strings, which is not getting translated correctly inside your function. You may use , parse with eval (NSE functions) to fix this or you can use get.

Single_chile<-function(data,var1,var2){


Tab <- dcast(data, get(var1) ~ get(var2), fun.aggregate = length)

Tab1<-Tab%>% mutate("Todo el Mercado"=rowSums(Tab[,2:ncol(Tab)]))

ALL <- as.list( c( var1 = "Número_de_Respuestas", colSums(Tab1[, 2:ncol(Tab1)]) ) )

Tab1[, 2:ncol(Tab1)]<- sapply(Tab1[, 2:ncol(Tab1)],prop.table)

Tab1[, 2:ncol(Tab1)]<- sapply(Tab1[, 2:ncol(Tab1)],function(x) paste0(round(x*100,0), "%"))
Tab2 <- rbindlist(l = list(Tab1, ALL))

Tab2
}

Single_chile(test,"Q27","Q12_1_TEXT")

I hope this solves your problem.

Thanks

lapply data.table setDT in nested lists does not work or is not idempotent?

I struggled a bit getting lapply to work too. I could get it to turn the data frames into data tables, but it refused to keep the row names.

I found a simple double loop works. It's probably making copies of the data frames before overwriting them, so I don't know if this will be fast enough for your needs. It seems to take about 6 milliseconds on your data using my machine.

for(i in 1:3) 
for(j in 1:2)
top_list[[i]][[j]] <- as.data.table(top_list[[i]][[j]], keep.rownames = "Sample")

This gives

top_list
#> $`aa`
#> $`aa`$`train_set`
#> Sample x y z
#> 1: Observation_1 2 0 Factor1
#> 2: Observation_2 2 1 Factor1
#> 3: Observation_3 2 2 Factor1
#> 4: Observation_4 2 3 Factor1
#> 5: Observation_5 2 4 Factor1
#> 6: Observation_6 2 5 Factor1
#> 7: Observation_7 2 6 Factor1
#> 8: Observation_8 2 7 Factor1
#> 9: Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#>
#> $`aa`$test_set
#> Sample x y z
#> 1: Observation_1 1 0 Factor2
#> 2: Observation_2 1 1 Factor2
#> 3: Observation_3 1 2 Factor2
#> 4: Observation_4 1 3 Factor2
#> 5: Observation_5 1 4 Factor2
#> 6: Observation_6 1 5 Factor2
#> 7: Observation_7 1 6 Factor2
#> 8: Observation_8 1 7 Factor2
#> 9: Observation_9 1 8 Factor2
#> 10: Observation_10 1 9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2
#>
#>
#> $bb
#> $bb$`train_set`
#> Sample x y z
#> 1: Observation_1 2 0 Factor1
#> 2: Observation_2 2 1 Factor1
#> 3: Observation_3 2 2 Factor1
#> 4: Observation_4 2 3 Factor1
#> 5: Observation_5 2 4 Factor1
#> 6: Observation_6 2 5 Factor1
#> 7: Observation_7 2 6 Factor1
#> 8: Observation_8 2 7 Factor1
#> 9: Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#>
#> $bb$test_set
#> Sample x y z
#> 1: Observation_1 1 0 Factor2
#> 2: Observation_2 1 1 Factor2
#> 3: Observation_3 1 2 Factor2
#> 4: Observation_4 1 3 Factor2
#> 5: Observation_5 1 4 Factor2
#> 6: Observation_6 1 5 Factor2
#> 7: Observation_7 1 6 Factor2
#> 8: Observation_8 1 7 Factor2
#> 9: Observation_9 1 8 Factor2
#> 10: Observation_10 1 9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2
#>
#>
#> $cc
#> $cc$`train_set`
#> Sample x y z
#> 1: Observation_1 2 0 Factor1
#> 2: Observation_2 2 1 Factor1
#> 3: Observation_3 2 2 Factor1
#> 4: Observation_4 2 3 Factor1
#> 5: Observation_5 2 4 Factor1
#> 6: Observation_6 2 5 Factor1
#> 7: Observation_7 2 6 Factor1
#> 8: Observation_8 2 7 Factor1
#> 9: Observation_9 2 8 Factor1
#> 10: Observation_10 2 9 Factor1
#>
#> $cc$test_set
#> Sample x y z
#> 1: Observation_1 1 0 Factor2
#> 2: Observation_2 1 1 Factor2
#> 3: Observation_3 1 2 Factor2
#> 4: Observation_4 1 3 Factor2
#> 5: Observation_5 1 4 Factor2
#> 6: Observation_6 1 5 Factor2
#> 7: Observation_7 1 6 Factor2
#> 8: Observation_8 1 7 Factor2
#> 9: Observation_9 1 8 Factor2
#> 10: Observation_10 1 9 Factor2
#> 11: Observation_11 1 10 Factor2
#> 12: Observation_12 1 11 Factor2

Can we setDT on multiple object all at once?

You can Filter data.frames in an environment and apply setDT to those:

all_data_tables = Filter(function(x) is.data.frame(eval(as.name(x))), ls())
lapply(all_data_tables, function(x) setDT(eval(as.name(x))))

You can also potentially replace is.data.frame with is.list or something more complicated, but I think is.data.frame covers your use case.

You can also use get and could also be more careful about specifying envir in ls/eval/get.



Related Topics



Leave a reply



Submit