Fast Melted Data.Table Operations

Fast melted data.table operations

Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :

setkey(input,variable)
input[c("x","y"),sum(value)]

This is like a fast HAVING in SQL. j gets evaluated for each row of i. In other words, the above is the same result but much faster than :

input[,sum(value),keyby=variable][c("x","y")]

The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.

The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.

The first setkey(input,variable) might bite if input has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :

DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]

In future when secondary keys are implemented that would be easier :

set2key(input,variable)              # add a secondary key 
input[c("x","y"),sum(value),key=2] # syntax speculative

To group by id as well :

setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']

and including id in the key might be worth setkey's cost depending on your data :

setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']

If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j is only run for each row of i when by is missing (hence the name by-without-by). So you need to include variable, again, in the by as shown above.

Alternatively, the following should group by id over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :

input[c("x","y"),sum(value),by=id]

fast melt large 2d matrix to 3 column data.table

2 Years Later...

Since reshape2 is retired and I didn't want to use tidyr (or anything other than data.table and base functions), I ended up with this solutions:

# binds row and column names from matrix with their values
cbind(
# We want a data.table as result
as.data.table(
# Cartesian product of matrix row and column names
# returns a data.frame with 2 column
expand.grid(rownames(m), colnames(m))
),
# Store matrix value into a vector:
# first all values from 1st column, then 2nd, and so on
value = as.vector(m)
)

NOTE: if your matrix column or rows are not named or if you just don't care about it, you should replace expand.grid(...) with this line:

expand.grid(c(1:nrow(m)), c(1:ncol(m)))

names are relevant when you have, for example, a correlation matrix.

Proper/fastest way to reshape a data.table

The data.table package implements faster melt/dcast functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.

melt/dcast functions for data.table have been available since v1.9.0 and the features include:

  • There is no need to load reshape2 package prior to casting. But if you want it loaded for other operations, please load it before loading data.table.

  • dcast is also a S3 generic. No more dcast.data.table(). Just use dcast().

  • melt:

    • is capable of melting on columns of type 'list'.

    • gains variable.factor and value.factor which by default are TRUE and FALSE respectively for compatibility with reshape2. This allows for directly controlling the output type of variable and value columns (as factors or not).

    • melt.data.table's na.rm = TRUE parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.

    • NEW: melt can accept a list for measure.vars and columns specified in each element of the list will be combined together. This is faciliated further through the use of patterns(). See vignette or ?melt.

  • dcast:

    • accepts multiple fun.aggregate and multiple value.var. See vignette or ?dcast.

    • use rowid() function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.

  • Old benchmarks:

    • melt : 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.
    • dcast : 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.

Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast pull request to reshape2?

melt dataframe - multiple columns - Enhanced (new) functionality from data.tables

I'm not sure about using melt, but here's a way using tidyr

Note that I changed the variable name to use a . instead of _ to separate the name for the old/new. This makes it easier to separate the name into two variables since there are already many underscores.

library(tidyr)

df <- dplyr::data_frame(
id = c(1:100),
Credit_risk_Capital.old= rnorm(100, mean = 400, sd = 60),
NameConcentration.old= rnorm(100, mean = 100, sd = 10),
Credit_risk_Capital.new =rnorm(100, mean = 200, sd = 10),
NameConcentration.new = rnorm(100, mean = 40, sd = 10)
)

df %>%
gather("key", "value", -id) %>%
separate(key, c("CapitalChargeType", "new_old"), sep = "\\.") %>%
spread(new_old, value)

#> # A tibble: 200 x 4
#> id CapitalChargeType new old
#> * <int> <chr> <dbl> <dbl>
#> 1 1 Credit_risk_Capital 182.10955 405.78530
#> 2 1 NameConcentration 42.21037 99.44172
#> 3 2 Credit_risk_Capital 184.28810 370.14308
#> 4 2 NameConcentration 60.92340 120.13933
#> 5 3 Credit_risk_Capital 191.07982 389.50818
#> 6 3 NameConcentration 25.81776 90.91502
#> 7 4 Credit_risk_Capital 193.64247 327.56853
#> 8 4 NameConcentration 32.71050 94.95743
#> 9 5 Credit_risk_Capital 208.63547 286.59351
#> 10 5 NameConcentration 40.76064 116.52747
#> # ... with 190 more rows

data.table: How to fast stack(DT) operation, and return data.table instead of returned data.frame

You can use gather from the package tidyr

library(tidyr)
DT <- data.table(a = c(1, 3), b = c(5, 2))
DT1 <- gather(DT, ind, values, a:b)

The second argument is the name of the new "key" variable, the third argument is the name of the new "value" variable, the last argument is the columns you want to gather (all in this case). Also, gather automatically calls a faster version of melt for data.tables.

Melt and cast data table using pattern

We can do this with splitstackshape. It gives the '.time_1' column automatically

library(splitstackshape)
merged.stack(dt, var.stubs=c("a", "b"), sep="_")
# id .time_1 a b
#1: 1 3 1 7
#2: 1 4 4 10
#3: 2 3 2 8
#4: 2 4 5 11
#5: 3 3 3 9
#6: 3 4 6 12

Merge two large data.tables based on column name of one table and column value of the other without melting

Using set():

setkey(DT1, "ID")
setkey(DT2, "ID")
for (k in names(DT1)[-1]) {
rows <- which(DT2[["col"]] == k)
set(DT2, i = rows, j = "col_value", DT1[DT2[rows], ..k])
}

ID col col_value
1: A col1 1
2: A col4 13
3: B col2 6
4: B col3 10
5: C col1 3

Note: Setting the key up front speeds up the process but reorders the rows.

Pivot on data.table similar to rehape melt function

For a data.table reshape, try the following:

dt[, list(variable = names(.SD), value = unlist(.SD, use.names = F)), by = uid]

The cost of the syntax is worth it; the function runs very quickly!



Related Topics



Leave a reply



Submit