Using lists inside data.table columns
Using :=
:
dt = data.table(id = 1:2, comment = vector("list", 2L))
# assign value 1 to just the first column of 'comment'
dt[1L, comment := 1L]
# assign value of 1 and "a" to rows 1 and 2
dt[, comment := list(1, "a")]
# assign value of "a","b" to row 1, and 1 to row 2 for 'comment'
dt[, comment := list(c("a", "b"), 1)]
# assign list(1, "a") to just 1 row of 'comment'
dt[1L, comment := list(list(list(1, "a")))]
For the last case, you'll need one more list
because data.table
uses list(.)
to look for values to assign to columns by reference.
Using set
:
dt = data.table(id = 1:2, comment = vector("list", 2L))
# assign value 1 to just the first column of 'comment'
set(dt, i=1L, j="comment", value=1L)
# assign value of 1 and "a" to rows 1 and 2
set(dt, j="comment", value=list(1, "a"))
# assign value of "a","b" to row 1, and 1 to row 2 for 'comment'
set(dt, j="comment", value=list(c("a", "b"), 1))
# assign list(1, "a") to just 1 row of 'comment'
set(dt, i=1L, j="comment", value=list(list(list(1, "a"))))
HTH
I'm using the current development version 1.9.3, but should just work fine on any other version.
> sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.9.3
loaded via a namespace (and not attached):
[1] plyr_1.8.0.99 reshape2_1.2.2 stringr_0.6.2 tools_3.0.3
How to do operations on list columns in an R data.table to output another list column?
Another solution using mapply
:
dt[, absvals := mapply(listcol, numericcol, FUN = function(x, y) abs(x-y))]
#output
dt
numericcol listcol absvals
1: 42 1,22, 3 41,20,39
2: 42 6 36
3: 42 1 41
4: 42 12 30
5: 42 5, 6,1123 37, 36,1081
6: 42 3 39
7: 42 42 0
8: 42 1 41
Transform list inside a column to data.table columns in r
The following code attempts to the purpose (considering "tickets", object from picture):
# Collecting basic data
basic_data <- tickets[, .(done_ratio,created_on,updated_on,closed_on)]
status_data <- data.frame("status_id" = integer(),"status_name" = character())
# Fetching Status (each list)
for(status in tickets[,status]){
status <- as.data.frame(rbind(status))
names(status) <- c("status_id","status_name")
status_data <- rbind(status_data,status)
}
status_data <- cbind(status_data,basic_data)
R: access element in a column of lists with data.table
I think yours is the only way, or similarly I would do dt[i == 1 & j == 1, a[[1]] ]
.
You could write a helper function like
get_a = function(ii,jj) dt[.(ii,jj), on=.(i,j), mult="first", a[[1]]]
but might regret it if you ever write a subset that matches 0 or 2+ rows of the table:
get_a(1,1) # works as expected
get_a(1,4) # returns NULL
get_a(1,1:2) # returns only (1,1)
If you want to avoid that, could add checks based on .N
...
get_listcol = function(..., d, list_col, join_cols = names(list(...)), mult = FALSE){
d[list(...), on=join_cols, nomatch=0, {
if (.N == 0L){
stop("No matches found.")
} else if (.N == 1L){
.SD[[1]][[1]]
} else {
if (mult){
.SD[[1]]
} else {
stop("Multiple matches found.")
}
}
}, .SDcols=list_col]
}
# usage
get_a2 = function(ii, jj) get_listcol(i = ii, j = jj, d = dt, list_col = "a")
get_a2(1,1) # works as expected
get_a2(1,4) # error
get_a2(1,1:2) # error
How to do faster list-column operations inside data.table
This is really a question about memory and data storage types. All of my discussion will be for 100,000 data elements so that everything doesn't bog down.
Let's examine a vector of length 100,000 vs. a list containing 100,000 separate elements.
object.size(rep(1L, 1E5))
#400048 bytes
object.size(replicate(1E5, 1, simplify = F))
#6400048 bytes
We went from 0.4 MB to 6.4 MB just by having the data stored differently!! When applying this to your function Map(veryfing_function, ...)
and only 1E5 elements:
dt <- data.table(letters = replicate(1e5, sample(letters[1:5], 3, TRUE), simplify = FALSE),
numbers = replicate(1e5, sample(letters[6:10], 3, TRUE), simplify = FALSE))
tic()
result2 <- Map(veryfing_function, dt[['letters']], dt[['numbers']])
toc()
# 11.93 sec elapsed
object.size(result2)
# 109,769,872 bytes
#example return:
[[1000]]
[[1000]]$`1`
[1] "cg" "bg" "cg"
[[1000]]$`2`
[1] "ch" "bh" "ch"
[[1000]]$`3`
[1] "ch" "bh" "ch"
We could do a simple modification to your function to return unnamed lists instead of splitting and we save a little bit of memory as the split()
appears to give named lists and I don't think we need the name:
verifying_function2 <- function(vec1, vec2) {
vector <- outer(vec1, vec2, paste0) #not as.vector
lapply(seq_len(ncol(vector)), function(i) vector[, i]) #no need to split, just return a list
}
tic()
result2_mod <- Map(verifying_function2, dt[['letters']], dt[['numbers']])
toc()
# 2.86 sec elapsed
object.size(result2_mod)
# 73,769,872 bytes
#example_output
[[1000]]
[[1000]][[1]]
[1] "cg" "bg" "cg"
[[1000]][[2]]
[1] "ch" "bh" "ch"
[[1000]][[3]]
[1] "ch" "bh" "ch"
The next step is why return a list of list at all. I am using lapply()
in the modified function just get to your output. Loosing the lapply()
would instead a list of matrices which I think would be as helpful:
tic()
result2_mod2 <- Map(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
toc()
# 1.66 sec elapsed
object.size(result2_mod2)
# 68,570,336 bytes
#example output:
[[1000]]
[,1] [,2] [,3]
[1,] "cg" "ch" "ch"
[2,] "bg" "bh" "bh"
[3,] "cg" "ch" "ch"
The last logical step is to just return a matrix. Note this whole time we've been fighting against simplification with mapply(..., simplify = F)
which is equivalent to Map()
.
tic()
result2_mod3 <- mapply(function(x,y) outer(x, y, paste0), dt[['letters']], dt[['numbers']])
toc()
# 1.3 sec elapsed
object.size(result2_mod3)
# 7,201,616 bytes
If you want some dimensionality, you can convert the large matrix into a 3D array:
tic()
result2_mod3_arr <- array(as.vector(result2_mod3), dim = c(3,3,1E5))
toc()
# 0.02 sec elapsed
result2_mod3_arr[,,1000]
[,1] [,2] [,3]
[1,] "cg" "ch" "ch"
[2,] "bg" "bh" "bh"
[3,] "cg" "ch" "ch"
object.size(result2_mod3_arr)
# 7,201,624 bytes
I also looked at @marbel's answer - it is faster and takes up only slightly more memory. My approach would likely benefit by converting the initial dt
list to something else sooner.
tic()
dt1 = as.data.table(do.call(rbind, dt[['letters']]))
dt2 = as.data.table(do.call(rbind, dt[['numbers']]))
res = data.table()
combs = expand.grid(names(dt1), names(dt2), stringsAsFactors=FALSE)
set(res, j=paste0(combs[,1], combs[,2]), value=paste0( dt1[, get(combs[,1])], dt2[, get(combs[,2])] ) )
toc()
# 0.14 sec elapsed
object.size(res)
# 7,215,384 bytes
tl;dr - convert your object to a matrix or data.frame to make it easier on your memory. It also makes sense that the data.table
versions of your function takes longer - there's likely more overhead than just directly applying mapply()
.
metaprogramming map on data.table list-columns
First Variation: using variables from the nested objetive
lapply
is enough. See the @diaggy's Answer.
Second Variation: using variables from and Outside the nested objetive
If you have to load a parameter from other column, it is neccesary pass from lapply
, to mapply
.
for(i in 1:length(calc_name)){
set(DT, NULL, j = calc_name[i],
value = mapply(function(x, m){
calc_metric_mean(a = x[[choose_a[i]]], b = x[[choose_b[i]]], metric = m)
}, x = DT$data, m = DT$metric, SIMPLIFY = FALSE
)
)
}
> DT
gp data metric d1d3 d2d1 d2d2
1: A <data.table[3x3]> max 1.7 3.5 4
2: B <data.table[3x3]> min 5.1 15 20
SIMPLIFY = FALSE
is required if it will return a list instead a vector.
Update data.tables inside a list
As mentioned in the comments by @r2evans , By replicating the data.table, you're just replicating the reference. Instead, replicate the data.frame and set it as data.table as part of your function.
library(data.table)
data("mtcars")
# Replicate dts in a list
dt_list <- rep(list(mtcars), 4)
# My Function
update_dt <- function(i){
setDT(dt_list[[i]]) # Set as DT inside function
# new column
dt_list[[i]][, newcol := i]
# Updating column
dt_list[[i]][, mpg:= mpg + ((i-1)* 100)]
}
# Apply function
lapply(X= 1:length(dt_list), FUN= update_dt)
Find if list in data.table contains word in other column
This is essentially just a hidden loop, but it will work:
words_data[mapply(`%in%`, word, names)]
# word names
#1: Tom Tom,Connolly,Pete,Dawson
I thought it might scale terribly, but it is okay:
words_data <- words_data[rep(1:13,1e5),]
nrow(words_data)
#[1] 1300000
system.time(words_data[mapply(`%in%`, word, names)])
# user system elapsed
# 1.329 0.016 1.345
The issue with most of the attempts in the question is that they are not considering the word
and names
piece-by-piece in vectorised comparison across multiple vectors. Map
or mapply
will take care of this:
mapply(paste, 1:3, letters[1:3])
#[1] "1 a" "2 b" "3 c"
The reasons why the other results didn't work are varied. E.g.:
%in%
This compares each value of word
in turn to see if it exists in names
exactly
words_data$word %in% words_data$names
#[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#[8] FALSE FALSE FALSE FALSE FALSE TRUE
"Dawson"
in row 13 of word
matches "Dawson"
in row 12 of names
. It won't match anything else that is a list
containing "Dawson"
along with other values though:
"Dawson" %in% list(list("Dawson","Tom"))
#[1] FALSE
unlist
"...basically suggests that the entire names
column is unlisted and all of the names word
checked against"
Yep, that's it.
sapply + unlist
The sapply
here didn't do anything to the names
object, because the unlist
is only run inside every list item anyway:
identical(words_data$names, sapply(words_data$names, unlist))
#[1] TRUE
Then you can reference the %in%
logic above for a reason as to why it didn't work as intended.
Modify list-column by reference in nested data.table
This can be done in following two steps or in Single Step:
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Step 1 - Let's add list of column hp
vectors in each row of dt
dt[, hp_vector := .(list(dt.mtcars[[1]][, hp])), by = list(gear)]
Step 2 - Now calculate the max of hp
dt[, max_hp := max(hp_vector[[1]]), by = list(gear)]
The given table is:
dt<- data.table(mtcars)[, list(dt.mtcars = list(.SD)), by = gear]
Single Step - Single step is actually the combination of both of the above steps:
dt[, max_hp := .(list(max(dt.mtcars[[1]][, hp])[[1]])), by = list(gear)]
If we wish to populate values within nested table by Reference then the following link talks about how to do it, just that we need to ignore a warning message. I will be happy if anyone can point me how to fix the warning message or is there any pitfall. For more detail please refer the link:
https://stackoverflow.com/questions/48306010/how-can-i-do-fast-advance-data-manipulation-in-nested-data-table-data-table-wi/48412406#48412406
Taking inspiration from the same i am going to show how to do it here for the given data set.
Let's first clean everything:
rm(list = ls())
Let's re-define the given table in different way:
dt<- data.table(mtcars)[, list(dt.mtcars = list(data.table(.SD))), by = list(gear)]
Note that i have defined the table slightly different. I have used data.table
in addition to list in the above definition.
Next, populate the max by reference within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, max_hp := max(hp)])), by = list(gear)]
And, what good one can expect, we can perform manipulation within nested table:
dt[, dt.mtcars := .(list(dt.mtcars[[1]][, weighted_hp_carb := max_hp*carb])), by = list(gear)]
Related Topics
How to Use R with Google Colaboratory
Dplyr Mutate Rowwise Max of Range of Columns
R Knitr Chunk Options for Figure Height/Width Are Not Working
How to Obtain an 'Unbalanced' Grid of Ggplots
Convert Named Character Vector to Data.Frame
Combining Bar and Line Chart (Double Axis) in Ggplot2
How to Create Grouped Barplot with R
How to Source R Markdown File Like 'Source('Myfile.R')'
How to Paste a String on Each Element of a Vector of Strings Using Apply in R
Multiplying All Elements of a Vector in R
Determine the Data Types of a Data Frame's Columns
Update a Value in One Column Based on Criteria in Other Columns
Concatenate Unique Strings After Groupby in R
Changing Font Size and Direction of Axes Text in Ggplot2
Rolling Join on Data.Table with Duplicate Keys
Split a Column of Concatenated Comma-Delimited Data and Recode Output as Factors