Fast melted data.table operations
Quick untested answer: seems like you're looking for by-without-by, a.k.a. grouping-by-i :
setkey(input,variable)
input[c("x","y"),sum(value)]
This is like a fast HAVING in SQL. j
gets evaluated for each row of i
. In other words, the above is the same result but much faster than :
input[,sum(value),keyby=variable][c("x","y")]
The latter subsets and evals for all the groups (wastefully) before selecting only the groups of interest. The former (by-without-by) goes straight to the subset of groups only.
The group results will be returned in long format, as always. But reshaping to wide afterwards on the (relatively small) aggregated data should be relatively instant. That's the thinking anyway.
The first setkey(input,variable)
might bite if input
has a lot of columns not of interest. If so, it might be worth subsetting the columns needed :
DT = setkey(input[ , c("variable","value")], variable)
DT[c("x","y"),sum(value)]
In future when secondary keys are implemented that would be easier :
set2key(input,variable) # add a secondary key
input[c("x","y"),sum(value),key=2] # syntax speculative
To group by id
as well :
setkey(input,variable)
input[c("x","y"),sum(value),by='variable,id']
and including id
in the key might be worth setkey
's cost depending on your data :
setkey(input,variable,id)
input[c("x","y"),sum(value),by='variable,id']
If you combine a by-without-by with by, as above, then the by-without-by then operates just like a subset; i.e., j
is only run for each row of i
when by is missing (hence the name by-without-by). So you need to include variable
, again, in the by
as shown above.
Alternatively, the following should group by id
over the union of "x" and "y" instead (but the above is what you asked for in the question, iiuc) :
input[c("x","y"),sum(value),by=id]
fast melt large 2d matrix to 3 column data.table
2 Years Later...
Since reshape2
is retired and I didn't want to use tidyr
(or anything other than data.table
and base functions), I ended up with this solutions:
# binds row and column names from matrix with their values
cbind(
# We want a data.table as result
as.data.table(
# Cartesian product of matrix row and column names
# returns a data.frame with 2 column
expand.grid(rownames(m), colnames(m))
),
# Store matrix value into a vector:
# first all values from 1st column, then 2nd, and so on
value = as.vector(m)
)
NOTE: if your matrix column or rows are not named or if you just don't care about it, you should replace expand.grid(...)
with this line:
expand.grid(c(1:nrow(m)), c(1:ncol(m)))
names are relevant when you have, for example, a correlation matrix.
Proper/fastest way to reshape a data.table
The data.table
package implements faster melt/dcast
functions (in C). It also has additional features by allowing to melt and cast multiple columns. Please see the new Efficient reshaping using data.tables on Github.
melt/dcast functions for data.table have been available since v1.9.0 and the features include:
There is no need to load
reshape2
package prior to casting. But if you want it loaded for other operations, please load it before loadingdata.table
.dcast
is also a S3 generic. No moredcast.data.table()
. Just usedcast()
.melt
:is capable of melting on columns of type 'list'.
gains
variable.factor
andvalue.factor
which by default areTRUE
andFALSE
respectively for compatibility withreshape2
. This allows for directly controlling the output type ofvariable
andvalue
columns (as factors or not).melt.data.table
'sna.rm = TRUE
parameter is internally optimised to remove NAs directly during melting and is therefore much more efficient.NEW:
melt
can accept a list formeasure.vars
and columns specified in each element of the list will be combined together. This is faciliated further through the use ofpatterns()
. See vignette or?melt
.
dcast
:accepts multiple
fun.aggregate
and multiplevalue.var
. See vignette or?dcast
.use
rowid()
function directly in formula to generate an id-column, which is sometimes required to identify the rows uniquely. See ?dcast.
Old benchmarks:
melt
: 10 million rows and 5 columns, 61.3 seconds reduced to 1.2 seconds.dcast
: 1 million rows and 4 columns, 192 seconds reduced to 3.6 seconds.
Reminder of Cologne (Dec 2013) presentation slide 32 : Why not submit a dcast
pull request to reshape2
?
melt dataframe - multiple columns - Enhanced (new) functionality from data.tables
I'm not sure about using melt
, but here's a way using tidyr
Note that I changed the variable name to use a .
instead of _
to separate the name for the old
/new
. This makes it easier to separate the name into two variables since there are already many underscores.
library(tidyr)
df <- dplyr::data_frame(
id = c(1:100),
Credit_risk_Capital.old= rnorm(100, mean = 400, sd = 60),
NameConcentration.old= rnorm(100, mean = 100, sd = 10),
Credit_risk_Capital.new =rnorm(100, mean = 200, sd = 10),
NameConcentration.new = rnorm(100, mean = 40, sd = 10)
)
df %>%
gather("key", "value", -id) %>%
separate(key, c("CapitalChargeType", "new_old"), sep = "\\.") %>%
spread(new_old, value)
#> # A tibble: 200 x 4
#> id CapitalChargeType new old
#> * <int> <chr> <dbl> <dbl>
#> 1 1 Credit_risk_Capital 182.10955 405.78530
#> 2 1 NameConcentration 42.21037 99.44172
#> 3 2 Credit_risk_Capital 184.28810 370.14308
#> 4 2 NameConcentration 60.92340 120.13933
#> 5 3 Credit_risk_Capital 191.07982 389.50818
#> 6 3 NameConcentration 25.81776 90.91502
#> 7 4 Credit_risk_Capital 193.64247 327.56853
#> 8 4 NameConcentration 32.71050 94.95743
#> 9 5 Credit_risk_Capital 208.63547 286.59351
#> 10 5 NameConcentration 40.76064 116.52747
#> # ... with 190 more rows
data.table: How to fast stack(DT) operation, and return data.table instead of returned data.frame
You can use gather
from the package tidyr
library(tidyr)
DT <- data.table(a = c(1, 3), b = c(5, 2))
DT1 <- gather(DT, ind, values, a:b)
The second argument is the name of the new "key" variable, the third argument is the name of the new "value" variable, the last argument is the columns you want to gather (all in this case). Also, gather
automatically calls a faster version of melt
for data.tables.
Melt and cast data table using pattern
We can do this with splitstackshape
. It gives the '.time_1' column automatically
library(splitstackshape)
merged.stack(dt, var.stubs=c("a", "b"), sep="_")
# id .time_1 a b
#1: 1 3 1 7
#2: 1 4 4 10
#3: 2 3 2 8
#4: 2 4 5 11
#5: 3 3 3 9
#6: 3 4 6 12
Merge two large data.tables based on column name of one table and column value of the other without melting
Using set()
:
setkey(DT1, "ID")
setkey(DT2, "ID")
for (k in names(DT1)[-1]) {
rows <- which(DT2[["col"]] == k)
set(DT2, i = rows, j = "col_value", DT1[DT2[rows], ..k])
}
ID col col_value
1: A col1 1
2: A col4 13
3: B col2 6
4: B col3 10
5: C col1 3
Note: Setting the key up front speeds up the process but reorders the rows.
Pivot on data.table similar to rehape melt function
For a data.table reshape, try the following:
dt[, list(variable = names(.SD), value = unlist(.SD, use.names = F)), by = uid]
The cost of the syntax is worth it; the function runs very quickly!
Related Topics
Manually Set Order of Fill Bars in Arbitrary Order Using Ggplot2
Fill in Gaps (E.G. Not Single Cells) of Na Values in Raster Using a Neighborhood Analysis
R - Column Names in Read.Table and Write.Table Starting with Number and Containing Space
How to Do Histograms of This Row-Column Table in R Ggplot
R: Apply Function to Matrix with Elements of Vector as Argument
How to Add Columnn Titles in a Sankey Chart Networkd3
Trouble Getting Latest Version of Gdal on Ubuntu Running R
Quantiles by Factor Levels in R
Horizontal Rule in R Markdown/Bookdown Causing Errors
Write a File Using 'saverds()' So That It Is Backwards Compatible with Old Versions of R
Creating Categorical Variables from Mutually Exclusive Dummy Variables
R: Check If Value from Dataframe Is Within Range Other Dataframe
How to Add Multiple Columns to a Tibble
Check If a Program Is Installed
Ggplot2 Positive and Negative Values Different Color Gradient