Using Lists Inside Data.Table Columns

Using lists inside data.table columns

Using :=:

dt = data.table(id = 1:2, comment = vector("list", 2L))

# assign value 1 to just the first column of 'comment'
dt[1L, comment := 1L]

# assign value of 1 and "a" to rows 1 and 2
dt[, comment := list(1, "a")]

# assign value of "a","b" to row 1, and 1 to row 2 for 'comment'
dt[, comment := list(c("a", "b"), 1)]

# assign list(1, "a") to just 1 row of 'comment'
dt[1L, comment := list(list(list(1, "a")))]

For the last case, you'll need one more list because data.table uses list(.) to look for values to assign to columns by reference.

Using set:

dt = data.table(id = 1:2, comment = vector("list", 2L))

# assign value 1 to just the first column of 'comment'
set(dt, i=1L, j="comment", value=1L)

# assign value of 1 and "a" to rows 1 and 2
set(dt, j="comment", value=list(1, "a"))

# assign value of "a","b" to row 1, and 1 to row 2 for 'comment'
set(dt, j="comment", value=list(c("a", "b"), 1))

# assign list(1, "a") to just 1 row of 'comment'
set(dt, i=1L, j="comment", value=list(list(list(1, "a"))))

HTH


I'm using the current development version 1.9.3, but should just work fine on any other version.

> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] data.table_1.9.3

loaded via a namespace (and not attached):
[1] plyr_1.8.0.99 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3

How to do operations on list columns in an R data.table to output another list column?

Another solution using mapply:

dt[, absvals := mapply(listcol, numericcol, FUN = function(x, y) abs(x-y))]

#output
dt
numericcol listcol absvals
1: 42 1,22, 3 41,20,39
2: 42 6 36
3: 42 1 41
4: 42 12 30
5: 42 5, 6,1123 37, 36,1081
6: 42 3 39
7: 42 42 0
8: 42 1 41

Transform list inside a column to data.table columns in r

The following code attempts to the purpose (considering "tickets", object from picture):

# Collecting basic data
basic_data <- tickets[, .(done_ratio,created_on,updated_on,closed_on)]

status_data <- data.frame("status_id" = integer(),"status_name" = character())
# Fetching Status (each list)
for(status in tickets[,status]){
status <- as.data.frame(rbind(status))
names(status) <- c("status_id","status_name")
status_data <- rbind(status_data,status)
}

status_data <- cbind(status_data,basic_data)

R: access element in a column of lists with data.table

I think yours is the only way, or similarly I would do dt[i == 1 & j == 1, a[[1]] ].

You could write a helper function like

get_a = function(ii,jj) dt[.(ii,jj), on=.(i,j), mult="first", a[[1]]]

but might regret it if you ever write a subset that matches 0 or 2+ rows of the table:

get_a(1,1)   # works as expected
get_a(1,4) # returns NULL
get_a(1,1:2) # returns only (1,1)

If you want to avoid that, could add checks based on .N...

get_listcol = function(..., d, list_col, join_cols = names(list(...)), mult = FALSE){
d[list(...), on=join_cols, nomatch=0, {
if (.N == 0L){
stop("No matches found.")
} else if (.N == 1L){
.SD[[1]][[1]]
} else {
if (mult){
.SD[[1]]
} else {
stop("Multiple matches found.")
}
}
}, .SDcols=list_col]
}

# usage
get_a2 = function(ii, jj) get_listcol(i = ii, j = jj, d = dt, list_col = "a")
get_a2(1,1) # works as expected
get_a2(1,4) # error
get_a2(1,1:2) # error

How to do faster list-column operations inside data.table

This is really a question about memory and data storage types. All of my discussion will be for 100,000 data elements so that everything doesn't bog down.

Let's examine a vector of length 100,000 vs. a list containing 100,000 separate elements.

object.size(rep(1L, 1E5))
#400048 bytes
object.size(replicate(1E5, 1, simplify = F))
#6400048 bytes

We went from 0.4 MB to 6.4 MB just by having the data stored differently!! When applying this to your function Map(veryfing_function, ...) and only 1E5 elements:

dt <- data.table(letters = replicate(1e5, sample(letters[1:5], 3, TRUE), simplify = FALSE),
numbers = replicate(1e5, sample(letters[6:10], 3, TRUE), simplify = FALSE))

tic()
result2 <- Map(veryfing_function, dt[['letters']], dt[['numbers']])
toc()
# 11.93 sec elapsed
object.size(result2)
# 109,769,872 bytes
#example return:
[[1000]]
[[1000]]$`1`
[1] "cg" "bg" "cg"

[[1000]]$`2`
[1] "ch" "bh" "ch"

[[1000]]$`3`
[1] "ch" "bh" "ch"

We could do a simple modification to your function to return unnamed lists instead of splitting and we save a little bit of memory as the split() appears to give named lists and I don't think we need the name:

verifying_function2 <- function(vec1, vec2) {
vector <- outer(vec1, vec2, paste0) #not as.vector
lapply(seq_len(ncol(vector)), function(i) vector[, i]) #no need to split, just return a list
}

tic()
result2_mod <- Map(verifying_function2, dt[['letters']], dt[['numbers']])
toc()
# 2.86 sec elapsed
object.size(result2_mod)
# 73,769,872 bytes

#example_output
[[1000]]
[[1000]][[1]]
[1] "cg" "bg" "cg"

[[1000]][[2]]
[1] "ch" "bh" "ch"

[[1000]][[3]]
[1] "ch" "bh" "ch"

The next step is why return a list of list at all. I am using lapply() in the modified function just get to your output. Loosing the lapply() would instead a list of matrices which I think would be as helpful:

tic()
result2_mod2 <- Map(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
toc()
# 1.66 sec elapsed
object.size(result2_mod2)
# 68,570,336 bytes

#example output:
[[1000]]
[,1] [,2] [,3]
[1,] "cg" "ch" "ch"
[2,] "bg" "bh" "bh"
[3,] "cg" "ch" "ch"

The last logical step is to just return a matrix. Note this whole time we've been fighting against simplification with mapply(..., simplify = F) which is equivalent to Map().

tic()
result2_mod3 <- mapply(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
toc()
# 1.3 sec elapsed
object.size(result2_mod3)
# 7,201,616 bytes

If you want some dimensionality, you can convert the large matrix into a 3D array:

tic()
result2_mod3_arr <- array(as.vector(result2_mod3), dim = c(3,3,1E5))
toc()
# 0.02 sec elapsed
result2_mod3_arr[,,1000]
[,1] [,2] [,3]
[1,] "cg" "ch" "ch"
[2,] "bg" "bh" "bh"
[3,] "cg" "ch" "ch"
object.size(result2_mod3_arr)
# 7,201,624 bytes

I also looked at @marbel's answer - it is faster and takes up only slightly more memory. My approach would likely benefit by converting the initial dt list to something else sooner.

tic()
dt1 = as.data.table(do.call(rbind, dt[['letters']]))
dt2 = as.data.table(do.call(rbind, dt[['numbers']]))

res = data.table()

combs = expand.grid(names(dt1), names(dt2), stringsAsFactors=FALSE)

set(res, j=paste0(combs[,1], combs[,2]), value=paste0( dt1[, get(combs[,1])], dt2[, get(combs[,2])] ) )
toc()
# 0.14 sec elapsed
object.size(res)
# 7,215,384 bytes

tl;dr - convert your object to a matrix or data.frame to make it easier on your memory. It also makes sense that the data.table versions of your function takes longer - there's likely more overhead than just directly applying mapply().

metaprogramming map on data.table list-columns

First Variation: using variables from the nested objetive

lapply is enough. See the @diaggy's Answer.

Second Variation: using variables from and Outside the nested objetive

If you have to load a parameter from other column, it is neccesary pass from lapply, to mapply.

for(i in 1:length(calc_name)){
set(DT, NULL, j = calc_name[i],
value = mapply(function(x, m){
calc_metric_mean(a = x[[choose_a[i]]], b = x[[choose_b[i]]], metric = m)
}, x = DT$data, m = DT$metric, SIMPLIFY = FALSE
)
)
}

> DT
gp data metric d1d3 d2d1 d2d2
1: A <data.table[3x3]> max 1.7 3.5 4
2: B <data.table[3x3]> min 5.1 15 20

SIMPLIFY = FALSE is required if it will return a list instead a vector.

Update data.tables inside a list

As mentioned in the comments by @r2evans , By replicating the data.table, you're just replicating the reference. Instead, replicate the data.frame and set it as data.table as part of your function.

library(data.table)
data("mtcars")



# Replicate dts in a list
dt_list <- rep(list(mtcars), 4)


# My Function
update_dt <- function(i){
setDT(dt_list[[i]]) # Set as DT inside function
# new column
dt_list[[i]][, newcol := i]

# Updating column
dt_list[[i]][, mpg:= mpg + ((i-1)* 100)]
}

# Apply function
lapply(X= 1:length(dt_list), FUN= update_dt)

Find if list in data.table contains word in other column

This is essentially just a hidden loop, but it will work:

words_data[mapply(`%in%`, word, names)]
# word names
#1: Tom Tom,Connolly,Pete,Dawson

I thought it might scale terribly, but it is okay:

words_data <- words_data[rep(1:13,1e5),]
nrow(words_data)
#[1] 1300000
system.time(words_data[mapply(`%in%`, word, names)])
# user system elapsed
# 1.329 0.016 1.345

The issue with most of the attempts in the question is that they are not considering the word and names piece-by-piece in vectorised comparison across multiple vectors. Map or mapply will take care of this:

mapply(paste, 1:3, letters[1:3])
#[1] "1 a" "2 b" "3 c"

The reasons why the other results didn't work are varied. E.g.:

%in%

This compares each value of word in turn to see if it exists in names exactly

words_data$word %in% words_data$names
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[8] FALSE FALSE FALSE FALSE FALSE TRUE

"Dawson" in row 13 of word matches "Dawson" in row 12 of names. It won't match anything else that is a list containing "Dawson" along with other values though:

"Dawson" %in% list(list("Dawson","Tom"))
#[1] FALSE

unlist

"...basically suggests that the entire names column is unlisted and all of the names word checked against"

Yep, that's it.

sapply + unlist

The sapply here didn't do anything to the names object, because the unlist is only run inside every list item anyway:

identical(words_data$names, sapply(words_data$names, unlist))
#[1] TRUE

Then you can reference the %in% logic above for a reason as to why it didn't work as intended.

Modify list-column by reference in nested data.table

This can be done in following two steps or in Single Step:

The given table is:

dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]

Step 1 - Let's add list of column hp vectors in each row of dt

dt[, hp_vector := .(list(dt.mtcars[[1]][, hp])), by = list(gear)]

Step 2 - Now calculate the max of hp

dt[, max_hp := max(hp_vector[[1]]), by = list(gear)]

The given table is:

dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]

Single Step - Single step is actually the combination of both of the above steps:

dt[, max_hp := .(list(max(dt.mtcars[[1]][, hp])[[1]])), by = list(gear)]

If we wish to populate values within nested table by Reference then the following link talks about how to do it, just that we need to ignore a warning message. I will be happy if anyone can point me how to fix the warning message or is there any pitfall. For more detail please refer the link:

https://stackoverflow.com/questions/48306010/how-can-i-do-fast-advance-data-manipulation-in-nested-data-table-data-table-wi/48412406#48412406

Taking inspiration from the same i am going to show how to do it here for the given data set.

Let's first clean everything:

rm(list = ls())

Let's re-define the given table in different way:

dt<- data.table(mtcars)[, list(dt.mtcars = list(data.table(.SD))), by = list(gear)]

Note that i have defined the table slightly different. I have used data.table in addition to list in the above definition.

Next, populate the max by reference within nested table:

dt[, dt.mtcars := .(list(dt.mtcars[[1]][, max_hp := max(hp)])), by = list(gear)]

And, what good one can expect, we can perform manipulation within nested table:

dt[, dt.mtcars := .(list(dt.mtcars[[1]][, weighted_hp_carb := max_hp*carb])), by = list(gear)]


Related Topics



Leave a reply



Submit