How to expand a large dataframe in R
expand.grid
is a useful function here,
mergedData <- merge(
expand.grid(id = unique(df$id), spp = unique(df$spp)),
df, by = c("id", "spp"), all =T)
mergedData[is.na(mergedData$y), ]$y <- 0
mergedData$date <- rep(levels(df$date),
each = length(levels(df$spp)))
Since you're not actually doing anything to subsets of the data I don't think plyr
will help, maybe more efficient ways with data.table
.
expand large data frame in R efficiently
Solution with base functions:
# split column by all available separators
a <- strsplit(example.df$more.info, "; |#|;")
# represent each result as a matrix with 3 columns
a <- lapply(a, function(v) matrix(v, ncol=3, byrow=TRUE))
# combine all matrixes in one big matrix
aa <- do.call(rbind, a)
# create indices of rows of initial data.frame which corresponds to the created big matrix
b <- unlist(sapply(seq_along(a), function(i) rep(i, nrow(a[[i]]))))
# combine initial data.frame and created big matrix
df <- cbind(example.df[b,], aa)
# remove unnecessary columns and rename remaining ones
df <- df[,-3]
colnames(df)[3:5] <- c("class", "topic", "grade")
To increase the speed you may replace all functions of apply
family in my code with mclapply
.
I cannot compare the speed since your dataset is very small.
Expand a large dataframe with keping two variables the same and every possible combination with third
Here is one solution. First lest's read your data:
df <- read.table(text="date kid kid2 sums
01/01/2012 A 12 123
01/10/2012 A 15 100
01/03/2012 B 10 900
01/01/2012 C 10 100", header=TRUE)
Then convert the date into Date
format:
df$date <- as.Date(df$date, format="%m/%d/%Y")
Now I will create a vector with all dates that you need, from january 1 to 31.
dates <- seq(as.Date("01/01/2012", format="%m/%d/%Y"),as.Date("01/31/2012", format="%m/%d/%Y"), by="day")
With that we can create a new data.frame
with all combinations of the dates and kids:
df2<-merge(dates, df[,c(-1, -4)], by=NULL)
names(df2)[1] <- "date"
To get the original sums back, we can merge them, but keeping all results, and reordering to get it into the order you want:
df3<-merge(df, df2, all=TRUE)
df3<-df3[order(df3$kid,df3$kid2, df3$date), ]
And, finally, if you want, you can replace NA
's with 0
's:
df3<-replace(df3, is.na(df3), 0)
Expand a data frame by group
use tidyr::pivot_wider
with names_glue
argument as follows.
- Store name of all variables (even 500) to be pivoted into a vector say
cols
- Use
values_from = all_of(cols)
as argument inpivot_wider
cols <- c('X1', 'X2', 'X5')
df %>% pivot_wider(id_cols = grp, names_from = X, values_from = all_of(cols),
names_glue = '{X}-{.value}')
# A tibble: 2 x 10
grp `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2020_01_19 23 13 23 47 45 41 3 54 21
2 2020_01_20 65 39 43 32 52 76 19 12 90
If you want to use all columns except first two, use this
df %>% pivot_wider(id_cols = grp, names_from = X, values_from = !c(grp, X),
names_glue = '{X}-{.value}')
# A tibble: 2 x 10
grp `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2020_01_19 23 13 23 47 45 41 3 54 21
2 2020_01_20 65 39 43 32 52 76 19 12 90
However, if you want to rearrange columns as shown in expected outcome, you may use names_vary = 'slowest'
in pivot_wider
function of tidyr 1.2.0.
Expand nested dataframe cell in long format
You can use unnest()
in tidyr
to expand a nested column.
tidyr::unnest(df, part_list)
# # A tibble: 3 x 2
# chapterid part_list
# <chr> <chr>
# 1 a c
# 2 a d
# 3 b e
Data
df <- data.frame(chapterid = c("a", "b"))
df$part_list <- list(c("c", "d"), "e")
# chapterid part_list
# 1 a c, d
# 2 b e
expand data frames inside data frame
We can use unnest
from library(tidyr)
library(tidyr)
unnest(df1, rdm)
# Source: local data frame [6 x 4]
# language sessionID V1 V2
# (chr) (dbl) (int) (int)
#1 Dutch 13257 1 2
#2 Dutch 13257 2 3
#3 Dutch 13257 3 4
#4 Dutch 125354 4 5
#5 Dutch 125354 5 6
#6 Dutch 125354 6 7
data
library(dplyr)
df1 <- data_frame(language=c('Dutch', 'Dutch'), sessionID=c(13257, 125354),
rdm= list(data.frame(V1=1:3, V2=2:4), data.frame(V1=4:6, V2=5:7)))
R: how to expand a row containing a list to several rows...one for each list member?
I've grown to really love data.table
for this kind of task. It is so very simple. But first, let's make some sample data (which you should provide idealy!)
# Sample data
set.seed(1)
df = data.frame( pep = replicate( 3 , paste( sample(999,3) , collapse=";") ) , pro = sample(3) , stringsAsFactors = FALSE )
Now we use the data.table
package to do the reshaping in a couple of lines...
# Load data.table package
require(data.table)
# Turn data.frame into data.table, which looks like..
dt <- data.table(df)
# pep pro
#1: 266;372;572 1
#2: 908;202;896 3
#3: 944;660;628 2
# Transform it in one line like this...
dt[ , list( pep = unlist( strsplit( pep , ";" ) ) ) , by = pro ]
# pro pep
#1: 1 266
#2: 1 372
#3: 1 572
#4: 3 908
#5: 3 202
#6: 3 896
#7: 2 944
#8: 2 660
#9: 2 628
How to expand a data.frame according to one of its columns?
We can do this without using any library i.e. using only base R
data.frame(value = with(df, match(more.strings, strings)),
strings = more.strings)
# value strings
#1 5 c
#2 1 e
#3 2 g
#4 NA a
#5 NA d
#6 3 h
#7 NA f
Or we can use complete
library(tidyverse)
complete(df, strings = more.strings) %>%
arrange(match(strings, more.strings)) %>%
select(names(df))
# A tibble: 7 x 2
# values strings
# <int> <chr>
#1 5 c
#2 1 e
#3 2 g
#4 NA a
#5 NA d
#6 3 h
#7 NA f
Fast and efficient way to expand a dataset in R
The core of the problem is the expansion of the values in the Key
columns into i
.
Here is another data.table
solution employing melt()
but differing in implementation details from David's comment:
library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
, .(i = seq_len(value)), by = .(Year, col)]
expanded
Year col i
1: 2001 1 1
2: 2001 1 2
3: 2001 1 3
4: 2001 1 4
5: 2001 1 5
---
2571: 2003 4 381
2572: 2003 4 382
2573: 2003 4 383
2574: 2003 4 384
2575: 2003 4 385
The remaining computations can be done like this (if I've understood OP's intention right)
set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df
Year col i p value
1: 2001 1 1 0.2875775 1.437888
2: 2001 1 2 0.7883051 3.941526
3: 2001 1 3 0.4089769 2.044885
4: 2001 1 4 0.8830174 4.415087
5: 2001 1 5 0.9404673 4.702336
---
2571: 2003 4 381 0.4711072 17.355536
2572: 2003 4 382 0.5323359 17.661680
2573: 2003 4 383 0.3953954 16.976977
2574: 2003 4 384 0.4544372 17.272186
2575: 2003 4 385 0.1149009 15.574505
Benchmarking the different approaches
As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:
- David's
data.table
solution plus a modification which ensures the result is identical with the expected result - ycw's
tidyverse
solution - my
data.table
solution
Benchmark code
For benchmarking, the microbenchmark
package is used.
library(magrittr)
bm <- microbenchmark::microbenchmark(
david1 = {
expanded_david1 <-
setorder(
melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
)[, i := seq_len(.N), by = .(Year, col)]
},
david2 = {
expanded_david2 <-
setorder(
melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
},
uwe = {
expanded_uwe <-
melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
, .(i = seq_len(value)), by = .(Year, col)]
},
ycw = {
expanded_ycw <- DT %>%
tidyr::gather(col, i, - Year) %>%
dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
dplyr::rowwise() %>%
dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
dplyr::select(Year, i, col) %>%
dplyr::arrange(Year, col, i)
},
times = 100L
)
bm
Note that references to tidyverse
functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2
variant converts factors to numbers of levels.
Timing the small sample data set
With the small sample data set with 3 years and 4 Key
columns provided by the OP the timings are as follows:
Unit: microseconds
expr min lq mean median uq max neval
david1 993.418 1161.4415 1260.4053 1244.320 1350.987 2000.805 100
david2 1261.500 1393.2760 1624.5298 1568.097 1703.837 5233.280 100
uwe 825.772 865.4175 979.2129 911.860 1084.226 1409.890 100
ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817 100
Even for this small problem size, the data.table
solutions are magnitudes faster than the tidyverse
approach with slight advantages for solution uwe
.
The results are checked to be equal:
all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE
Except for david1
which returns factors instead of integers and a different ordering, all four results are identical.
Larger benchmark case
Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key
columns. In the sample data set the overall mean of Key
values is 215. With these parameters, a larger data set is being created:
n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]
The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:
Unit: milliseconds
expr min lq mean median uq max neval
david1 2.512805 2.648735 2.726743 2.697065 2.698576 3.076535 5
david2 2.791838 2.816758 2.998828 3.068605 3.075780 3.241160 5
uwe 1.329088 1.453312 1.585390 1.514857 1.634551 1.995142 5
ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047 5
For this problem size, uwe
is nearly twice as fast as the other data.table
implementations. The tidyverse
approach is still magnitudes slower.
split and expand.grid by group on large data set
One possible solution which avoids repetitions of the same pair as well as different orders is using the data.table
and combinat
packages:
library(data.table)
setDT(df)[order(id), data.table(combinat::combn2(unique(id))), by = group]
group V1 V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4: 387969 209044061 209044062
5: 388978 209044061 209044062
6: 2278460 209044182 209044183
order(id)
is used here just for convenience to better check the results but can be skipped in production code.
Replace combn2()
by a non-equi join
There is another approach where the call to combn2()
is replaced by a non-equi join:
mdf <- setDT(df)[order(id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
allow.cartesian = TRUE]
group V1 V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4: 387969 209044061 209044062
5: 388978 209044061 209044062
6: 2278460 209044182 209044183
Note that the non-equi join requires the data to be ordered.
Benchmark
The second method seems to be much faster
# create benchmark data
nr <- 1.2e5L # number of rows
rg <- 8L # number of ids within each group
ng <- nr / rg # number of groups
set.seed(1L)
df2 <- data.table(
id = sample.int(rg, nr, TRUE),
group = sample.int(ng, nr, TRUE)
)
#benchmark code
microbenchmark::microbenchmark(
combn2 = df2[order(group, id), data.table((combinat::combn2(unique(id)))), by = group],
nej = {
mdf <- df2[order(group, id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
allow.cartesian = TRUE]},
times = 1L)
For 120000 rows and 14994 groups the timings are:
Unit: milliseconds
expr min lq mean median uq max neval
combn2 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 1
nej 137.3228 137.3228 137.3228 137.3228 137.3228 137.3228 1
Caveat
As pointed out by the OP the number of id
per group
is crucial in terms of memory consumption and speed. The number of combinations is of O(n2), exactly n * (n-1) / 2 or choose(n, 2L)
if n is the number of ids.
The size of the largest group can be found by
df2[, uniqueN(id), by = group][, max(V1)]
The total number of rows in the final result can be computed in advance by
df2[, uniqueN(id), by = group][, sum(choose(V1, 2L))]
Related Topics
Compute Projection/Hat Matrix via Qr Factorization, Svd (And Cholesky Factorization)
How to Prep Transaction Data into Basket for Arules
Manipulating Files with Non-English Names in R
Find Matches of a Vector of Strings in Another Vector of Strings
Using Variable Value as Column Name in Data.Frame or Cbind
Replace Na with Previous and Next Rows Mean in R
How to Use Tidyr to Fill in Completed Rows Within Each Value of a Grouping Variable
Can Sparklyr Be Used with Spark Deployed on Yarn-Managed Hadoop Cluster
Harvest (Rvest) Multiple HTML Pages from a List of Urls
Adding Multiple Lag Variables Using Dplyr and for Loops
How to Run a R Language(.R) File Using Batch File
Adding Multiple Columns in a Dplyr Mutate Call
Nas Are Not Allowed in Subscripted Assignments
How to Run a High Pass or Low Pass Filter on Data Points in R
R Windows Os Choose.Dir() File Chooser Won't Open at Working Directory