How to Expand a Large Dataframe in R

How to expand a large dataframe in R

expand.grid is a useful function here,

mergedData <- merge(
expand.grid(id = unique(df$id), spp = unique(df$spp)),
df, by = c("id", "spp"), all =T)

mergedData[is.na(mergedData$y), ]$y <- 0

mergedData$date <- rep(levels(df$date),
each = length(levels(df$spp)))

Since you're not actually doing anything to subsets of the data I don't think plyr will help, maybe more efficient ways with data.table.

expand large data frame in R efficiently

Solution with base functions:

# split column by all available separators 
a <- strsplit(example.df$more.info, "; |#|;")
# represent each result as a matrix with 3 columns
a <- lapply(a, function(v) matrix(v, ncol=3, byrow=TRUE))
# combine all matrixes in one big matrix
aa <- do.call(rbind, a)
# create indices of rows of initial data.frame which corresponds to the created big matrix
b <- unlist(sapply(seq_along(a), function(i) rep(i, nrow(a[[i]]))))
# combine initial data.frame and created big matrix
df <- cbind(example.df[b,], aa)
# remove unnecessary columns and rename remaining ones
df <- df[,-3]
colnames(df)[3:5] <- c("class", "topic", "grade")

To increase the speed you may replace all functions of apply family in my code with mclapply.

I cannot compare the speed since your dataset is very small.

Expand a large dataframe with keping two variables the same and every possible combination with third

Here is one solution. First lest's read your data:

df <- read.table(text="date          kid  kid2  sums  
01/01/2012 A 12 123
01/10/2012 A 15 100
01/03/2012 B 10 900
01/01/2012 C 10 100", header=TRUE)

Then convert the date into Date format:

df$date <- as.Date(df$date, format="%m/%d/%Y")

Now I will create a vector with all dates that you need, from january 1 to 31.

dates <- seq(as.Date("01/01/2012", format="%m/%d/%Y"),as.Date("01/31/2012", format="%m/%d/%Y"), by="day") 

With that we can create a new data.frame with all combinations of the dates and kids:

df2<-merge(dates, df[,c(-1, -4)], by=NULL)
names(df2)[1] <- "date"

To get the original sums back, we can merge them, but keeping all results, and reordering to get it into the order you want:

df3<-merge(df, df2, all=TRUE)
df3<-df3[order(df3$kid,df3$kid2, df3$date), ]

And, finally, if you want, you can replace NA's with 0's:

df3<-replace(df3, is.na(df3), 0)

Expand a data frame by group

use tidyr::pivot_wider with names_glue argument as follows.

  • Store name of all variables (even 500) to be pivoted into a vector say cols
  • Use values_from = all_of(cols) as argument in pivot_wider
cols <- c('X1', 'X2', 'X5')
df %>% pivot_wider(id_cols = grp, names_from = X, values_from = all_of(cols),
names_glue = '{X}-{.value}')

# A tibble: 2 x 10
grp `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2020_01_19 23 13 23 47 45 41 3 54 21
2 2020_01_20 65 39 43 32 52 76 19 12 90

If you want to use all columns except first two, use this

df %>% pivot_wider(id_cols = grp, names_from = X, values_from = !c(grp, X), 
names_glue = '{X}-{.value}')

# A tibble: 2 x 10
grp `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 2020_01_19 23 13 23 47 45 41 3 54 21
2 2020_01_20 65 39 43 32 52 76 19 12 90

However, if you want to rearrange columns as shown in expected outcome, you may use names_vary = 'slowest' in pivot_wider function of tidyr 1.2.0.

Expand nested dataframe cell in long format

You can use unnest() in tidyr to expand a nested column.

tidyr::unnest(df, part_list)

# # A tibble: 3 x 2
# chapterid part_list
# <chr> <chr>
# 1 a c
# 2 a d
# 3 b e

Data

df <- data.frame(chapterid = c("a", "b"))
df$part_list <- list(c("c", "d"), "e")

# chapterid part_list
# 1 a c, d
# 2 b e

expand data frames inside data frame

We can use unnest from library(tidyr)

library(tidyr)
unnest(df1, rdm)
# Source: local data frame [6 x 4]

# language sessionID V1 V2
# (chr) (dbl) (int) (int)
#1 Dutch 13257 1 2
#2 Dutch 13257 2 3
#3 Dutch 13257 3 4
#4 Dutch 125354 4 5
#5 Dutch 125354 5 6
#6 Dutch 125354 6 7

data

library(dplyr)
df1 <- data_frame(language=c('Dutch', 'Dutch'), sessionID=c(13257, 125354),
rdm= list(data.frame(V1=1:3, V2=2:4), data.frame(V1=4:6, V2=5:7)))

R: how to expand a row containing a list to several rows...one for each list member?

I've grown to really love data.table for this kind of task. It is so very simple. But first, let's make some sample data (which you should provide idealy!)

#  Sample data
set.seed(1)
df = data.frame( pep = replicate( 3 , paste( sample(999,3) , collapse=";") ) , pro = sample(3) , stringsAsFactors = FALSE )

Now we use the data.table package to do the reshaping in a couple of lines...

#  Load data.table package
require(data.table)

# Turn data.frame into data.table, which looks like..
dt <- data.table(df)
# pep pro
#1: 266;372;572 1
#2: 908;202;896 3
#3: 944;660;628 2

# Transform it in one line like this...
dt[ , list( pep = unlist( strsplit( pep , ";" ) ) ) , by = pro ]
# pro pep
#1: 1 266
#2: 1 372
#3: 1 572
#4: 3 908
#5: 3 202
#6: 3 896
#7: 2 944
#8: 2 660
#9: 2 628

How to expand a data.frame according to one of its columns?

We can do this without using any library i.e. using only base R

data.frame(value = with(df, match(more.strings, strings)), 
strings = more.strings)
# value strings
#1 5 c
#2 1 e
#3 2 g
#4 NA a
#5 NA d
#6 3 h
#7 NA f

Or we can use complete

library(tidyverse)
complete(df, strings = more.strings) %>%
arrange(match(strings, more.strings)) %>%
select(names(df))
# A tibble: 7 x 2
# values strings
# <int> <chr>
#1 5 c
#2 1 e
#3 2 g
#4 NA a
#5 NA d
#6 3 h
#7 NA f

Fast and efficient way to expand a dataset in R

The core of the problem is the expansion of the values in the Key columns into i.

Here is another data.table solution employing melt() but differing in implementation details from David's comment:

library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
, .(i = seq_len(value)), by = .(Year, col)]
expanded
      Year col   i
1: 2001 1 1
2: 2001 1 2
3: 2001 1 3
4: 2001 1 4
5: 2001 1 5
---
2571: 2003 4 381
2572: 2003 4 382
2573: 2003 4 383
2574: 2003 4 384
2575: 2003 4 385

The remaining computations can be done like this (if I've understood OP's intention right)

set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df
      Year col   i         p     value
1: 2001 1 1 0.2875775 1.437888
2: 2001 1 2 0.7883051 3.941526
3: 2001 1 3 0.4089769 2.044885
4: 2001 1 4 0.8830174 4.415087
5: 2001 1 5 0.9404673 4.702336
---
2571: 2003 4 381 0.4711072 17.355536
2572: 2003 4 382 0.5323359 17.661680
2573: 2003 4 383 0.3953954 16.976977
2574: 2003 4 384 0.4544372 17.272186
2575: 2003 4 385 0.1149009 15.574505


Benchmarking the different approaches

As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:

  • David's data.table solution plus a modification which ensures the result is identical with the expected result
  • ycw's tidyverse solution
  • my data.table solution

Benchmark code

For benchmarking, the microbenchmark package is used.

library(magrittr)
bm <- microbenchmark::microbenchmark(
david1 = {
expanded_david1 <-
setorder(
melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
)[, i := seq_len(.N), by = .(Year, col)]
},
david2 = {
expanded_david2 <-
setorder(
melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
},
uwe = {
expanded_uwe <-
melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
, .(i = seq_len(value)), by = .(Year, col)]
},
ycw = {
expanded_ycw <- DT %>%
tidyr::gather(col, i, - Year) %>%
dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
dplyr::rowwise() %>%
dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
dplyr::select(Year, i, col) %>%
dplyr::arrange(Year, col, i)
},
times = 100L
)
bm

Note that references to tidyverse functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2 variant converts factors to numbers of levels.

Timing the small sample data set

With the small sample data set with 3 years and 4 Key columns provided by the OP the timings are as follows:

Unit: microseconds
expr min lq mean median uq max neval
david1 993.418 1161.4415 1260.4053 1244.320 1350.987 2000.805 100
david2 1261.500 1393.2760 1624.5298 1568.097 1703.837 5233.280 100
uwe 825.772 865.4175 979.2129 911.860 1084.226 1409.890 100
ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817 100

Even for this small problem size, the data.table solutions are magnitudes faster than the tidyverse approach with slight advantages for solution uwe.

The results are checked to be equal:

all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE

Except for david1 which returns factors instead of integers and a different ordering, all four results are identical.

Larger benchmark case

Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key columns. In the sample data set the overall mean of Key values is 215. With these parameters, a larger data set is being created:

n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]

The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:

Unit: milliseconds
expr min lq mean median uq max neval
david1 2.512805 2.648735 2.726743 2.697065 2.698576 3.076535 5
david2 2.791838 2.816758 2.998828 3.068605 3.075780 3.241160 5
uwe 1.329088 1.453312 1.585390 1.514857 1.634551 1.995142 5
ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047 5

For this problem size, uwe is nearly twice as fast as the other data.table implementations. The tidyverse approach is still magnitudes slower.

split and expand.grid by group on large data set

One possible solution which avoids repetitions of the same pair as well as different orders is using the data.table and combinat packages:

library(data.table)
setDT(df)[order(id), data.table(combinat::combn2(unique(id))), by = group]
     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4: 387969 209044061 209044062
5: 388978 209044061 209044062
6: 2278460 209044182 209044183

order(id) is used here just for convenience to better check the results but can be skipped in production code.

Replace combn2() by a non-equi join

There is another approach where the call to combn2() is replaced by a non-equi join:

mdf <- setDT(df)[order(id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
allow.cartesian = TRUE]
     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4: 387969 209044061 209044062
5: 388978 209044061 209044062
6: 2278460 209044182 209044183

Note that the non-equi join requires the data to be ordered.

Benchmark

The second method seems to be much faster

# create benchmark data
nr <- 1.2e5L # number of rows
rg <- 8L # number of ids within each group
ng <- nr / rg # number of groups
set.seed(1L)
df2 <- data.table(
id = sample.int(rg, nr, TRUE),
group = sample.int(ng, nr, TRUE)
)

#benchmark code
microbenchmark::microbenchmark(
combn2 = df2[order(group, id), data.table((combinat::combn2(unique(id)))), by = group],
nej = {
mdf <- df2[order(group, id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
allow.cartesian = TRUE]},
times = 1L)

For 120000 rows and 14994 groups the timings are:

Unit: milliseconds
expr min lq mean median uq max neval
combn2 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 1
nej 137.3228 137.3228 137.3228 137.3228 137.3228 137.3228 1

Caveat

As pointed out by the OP the number of id per group is crucial in terms of memory consumption and speed. The number of combinations is of O(n2), exactly n * (n-1) / 2 or choose(n, 2L) if n is the number of ids.

The size of the largest group can be found by

df2[, uniqueN(id), by = group][, max(V1)]

The total number of rows in the final result can be computed in advance by

df2[, uniqueN(id), by = group][, sum(choose(V1, 2L))]


Related Topics



Leave a reply



Submit