How to Expand a Large Dataframe in R

How to expand a large dataframe in R

expand.grid is a useful function here,

mergedData <- merge(
    expand.grid(id = unique(df$id), spp = unique(df$spp)),
    df, by = c("id", "spp"), all =T)

mergedData[is.na(mergedData$y), ]$y <- 0

mergedData$date <- rep(levels(df$date),
                       each = length(levels(df$spp)))

Since you're not actually doing anything to subsets of the data I don't think plyr will help, maybe more efficient ways with data.table.

expand large data frame in R efficiently

Solution with base functions:

# split column by all available separators 
a <- strsplit(example.df$more.info, "; |#|;")
# represent each result as a matrix with 3 columns
a <- lapply(a, function(v) matrix(v, ncol=3, byrow=TRUE))
# combine all matrixes in one big matrix
aa <- do.call(rbind, a)
# create indices of rows of initial data.frame which corresponds to the created big matrix
b <- unlist(sapply(seq_along(a), function(i) rep(i, nrow(a[[i]]))))
# combine initial data.frame and created big matrix
df <- cbind(example.df[b,], aa)
# remove unnecessary columns and rename remaining ones
df <- df[,-3]
colnames(df)[3:5] <- c("class", "topic", "grade")

To increase the speed you may replace all functions of apply family in my code with mclapply.

I cannot compare the speed since your dataset is very small.

Expand a large dataframe with keping two variables the same and every possible combination with third

Here is one solution. First lest's read your data:

df <- read.table(text="date          kid  kid2  sums  
01/01/2012    A    12    123    
01/10/2012    A    15    100    
01/03/2012    B    10    900   
01/01/2012    C    10    100", header=TRUE)

Then convert the date into Date format:

df$date <- as.Date(df$date, format="%m/%d/%Y")

Now I will create a vector with all dates that you need, from january 1 to 31.

dates <- seq(as.Date("01/01/2012", format="%m/%d/%Y"),as.Date("01/31/2012", format="%m/%d/%Y"), by="day")

With that we can create a new data.frame with all combinations of the dates and kids:

df2<-merge(dates, df[,c(-1, -4)], by=NULL)
names(df2)[1] <- "date"

To get the original sums back, we can merge them, but keeping all results, and reordering to get it into the order you want:

df3<-merge(df, df2, all=TRUE)
df3<-df3[order(df3$kid,df3$kid2, df3$date), ]

And, finally, if you want, you can replace NA's with 0's:

df3<-replace(df3, is.na(df3), 0)

Expand a data frame by group

use tidyr::pivot_wider with names_glue argument as follows.

Store name of all variables (even 500) to be pivoted into a vector say cols
Use values_from = all_of(cols) as argument in pivot_wider

cols <- c('X1', 'X2', 'X5')
df %>% pivot_wider(id_cols = grp, names_from = X, values_from = all_of(cols), 
                  names_glue = '{X}-{.value}')

# A tibble: 2 x 10
  grp        `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
  <chr>       <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
1 2020_01_19     23     13     23     47     45     41      3     54     21
2 2020_01_20     65     39     43     32     52     76     19     12     90

If you want to use all columns except first two, use this

df %>% pivot_wider(id_cols = grp, names_from = X, values_from = !c(grp, X), 
                   names_glue = '{X}-{.value}')

# A tibble: 2 x 10
  grp        `1-X1` `2-X1` `5-X1` `1-X2` `2-X2` `5-X2` `1-X5` `2-X5` `5-X5`
  <chr>       <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
1 2020_01_19     23     13     23     47     45     41      3     54     21
2 2020_01_20     65     39     43     32     52     76     19     12     90

However, if you want to rearrange columns as shown in expected outcome, you may use names_vary = 'slowest' in pivot_wider function of tidyr 1.2.0.

Expand nested dataframe cell in long format

You can use unnest() in tidyr to expand a nested column.

tidyr::unnest(df, part_list)

# # A tibble: 3 x 2
#   chapterid part_list
#   <chr>     <chr>    
# 1 a         c        
# 2 a         d        
# 3 b         e

Data

df <- data.frame(chapterid = c("a", "b"))
df$part_list <- list(c("c", "d"), "e")

#   chapterid part_list
# 1         a      c, d
# 2         b         e

expand data frames inside data frame

We can use unnest from library(tidyr)

library(tidyr)
unnest(df1, rdm)
#  Source: local data frame [6 x 4]

#  language sessionID    V1    V2
#     (chr)     (dbl) (int) (int)
#1    Dutch     13257     1     2
#2    Dutch     13257     2     3
#3    Dutch     13257     3     4
#4    Dutch    125354     4     5
#5    Dutch    125354     5     6
#6    Dutch    125354     6     7

data

library(dplyr)
df1 <- data_frame(language=c('Dutch', 'Dutch'), sessionID=c(13257, 125354),
  rdm= list(data.frame(V1=1:3, V2=2:4), data.frame(V1=4:6, V2=5:7)))

R: how to expand a row containing a list to several rows...one for each list member?

I've grown to really love data.table for this kind of task. It is so very simple. But first, let's make some sample data (which you should provide idealy!)

#  Sample data
set.seed(1)
df = data.frame( pep = replicate( 3 , paste( sample(999,3) , collapse=";") ) , pro = sample(3) , stringsAsFactors = FALSE )

Now we use the data.table package to do the reshaping in a couple of lines...

#  Load data.table package
require(data.table)

#  Turn data.frame into data.table, which looks like..
dt <- data.table(df)
#           pep pro
#1: 266;372;572   1
#2: 908;202;896   3
#3: 944;660;628   2

# Transform it in one line like this...
dt[ , list( pep = unlist( strsplit( pep , ";" ) ) ) , by = pro ]
#   pro pep
#1:   1 266
#2:   1 372
#3:   1 572
#4:   3 908
#5:   3 202
#6:   3 896
#7:   2 944
#8:   2 660
#9:   2 628

How to expand a data.frame according to one of its columns?

We can do this without using any library i.e. using only base R

data.frame(value = with(df, match(more.strings, strings)), 
        strings = more.strings)
#    value strings
#1     5       c
#2     1       e
#3     2       g
#4    NA       a
#5    NA       d
#6     3       h
#7    NA       f

Or we can use complete

library(tidyverse)
complete(df, strings = more.strings) %>% 
     arrange(match(strings, more.strings)) %>%
     select(names(df))
# A tibble: 7 x 2
#  values strings
#   <int> <chr>  
#1      5 c      
#2      1 e      
#3      2 g      
#4     NA a      
#5     NA d      
#6      3 h      
#7     NA f

Fast and efficient way to expand a dataset in R

The core of the problem is the expansion of the values in the Key columns into i.

Here is another data.table solution employing melt() but differing in implementation details from David's comment:

library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
  , .(i = seq_len(value)), by = .(Year, col)]
expanded

      Year col   i
   1: 2001   1   1
   2: 2001   1   2
   3: 2001   1   3
   4: 2001   1   4
   5: 2001   1   5
  ---             
2571: 2003   4 381
2572: 2003   4 382
2573: 2003   4 383
2574: 2003   4 384
2575: 2003   4 385

The remaining computations can be done like this (if I've understood OP's intention right)

set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df

      Year col   i         p     value
   1: 2001   1   1 0.2875775  1.437888
   2: 2001   1   2 0.7883051  3.941526
   3: 2001   1   3 0.4089769  2.044885
   4: 2001   1   4 0.8830174  4.415087
   5: 2001   1   5 0.9404673  4.702336
  ---                                 
2571: 2003   4 381 0.4711072 17.355536
2572: 2003   4 382 0.5323359 17.661680
2573: 2003   4 383 0.3953954 16.976977
2574: 2003   4 384 0.4544372 17.272186
2575: 2003   4 385 0.1149009 15.574505

Benchmarking the different approaches

As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:

David's data.table solution plus a modification which ensures the result is identical with the expected result
ycw's tidyverse solution
my data.table solution

Benchmark code

For benchmarking, the microbenchmark package is used.

library(magrittr)
bm <- microbenchmark::microbenchmark(
  david1 = {
    expanded_david1 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
      )[, i := seq_len(.N), by = .(Year, col)]
  },
  david2 = {
    expanded_david2 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
          rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
  },
  uwe = {
    expanded_uwe <- 
      melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
        , .(i = seq_len(value)), by = .(Year, col)]
  },
  ycw = {
    expanded_ycw <- DT %>%
      tidyr::gather(col, i, - Year) %>%
      dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
      dplyr::rowwise() %>%
      dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
      dplyr::select(Year, i, col) %>%
      dplyr::arrange(Year, col, i)
  },
  times = 100L
)
bm

Note that references to tidyverse functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2 variant converts factors to numbers of levels.

Timing the small sample data set

With the small sample data set with 3 years and 4 Key columns provided by the OP the timings are as follows:

Unit: microseconds
   expr       min         lq        mean    median         uq        max neval
 david1   993.418  1161.4415   1260.4053  1244.320   1350.987   2000.805   100
 david2  1261.500  1393.2760   1624.5298  1568.097   1703.837   5233.280   100
    uwe   825.772   865.4175    979.2129   911.860   1084.226   1409.890   100
    ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817   100

Even for this small problem size, the data.table solutions are magnitudes faster than the tidyverse approach with slight advantages for solution uwe.

The results are checked to be equal:

all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE

Except for david1 which returns factors instead of integers and a different ordering, all four results are identical.

Larger benchmark case

Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key columns. In the sample data set the overall mean of Key values is 215. With these parameters, a larger data set is being created:

n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]

The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 david1    2.512805    2.648735    2.726743    2.697065    2.698576    3.076535     5
 david2    2.791838    2.816758    2.998828    3.068605    3.075780    3.241160     5
    uwe    1.329088    1.453312    1.585390    1.514857    1.634551    1.995142     5
    ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047     5

For this problem size, uwe is nearly twice as fast as the other data.table implementations. The tidyverse approach is still magnitudes slower.

split and expand.grid by group on large data set

One possible solution which avoids repetitions of the same pair as well as different orders is using the data.table and combinat packages:

library(data.table)
setDT(df)[order(id), data.table(combinat::combn2(unique(id))), by = group]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

order(id) is used here just for convenience to better check the results but can be skipped in production code.

Replace `combn2()` by a non-equi join

There is another approach where the call to combn2() is replaced by a non-equi join:

mdf <- setDT(df)[order(id), unique(id), by = group]
mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
    allow.cartesian = TRUE]

     group        V1        V2
1: 2365686 209044052 209044061
2: 2365686 209044052 209044062
3: 2365686 209044061 209044062
4:  387969 209044061 209044062
5:  388978 209044061 209044062
6: 2278460 209044182 209044183

Note that the non-equi join requires the data to be ordered.

Benchmark

The second method seems to be much faster

# create benchmark data
nr <- 1.2e5L # number of rows
rg <- 8L # number of ids within each group
ng <- nr / rg # number of groups
set.seed(1L)
df2 <- data.table(
  id = sample.int(rg, nr, TRUE),
  group = sample.int(ng, nr, TRUE)
)

#benchmark code
microbenchmark::microbenchmark(
  combn2 = df2[order(group, id), data.table((combinat::combn2(unique(id)))), by = group],
  nej = {
    mdf <- df2[order(group, id), unique(id), by = group]
    mdf[mdf, on = .(group, V1 < V1), .(group, x.V1, i.V1), nomatch = 0L,
        allow.cartesian = TRUE]},
  times = 1L)

For 120000 rows and 14994 groups the timings are:

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 combn2 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115 10259.1115     1
    nej   137.3228   137.3228   137.3228   137.3228   137.3228   137.3228     1

Caveat

As pointed out by the OP the number of id per group is crucial in terms of memory consumption and speed. The number of combinations is of O(n²), exactly n * (n-1) / 2 or choose(n, 2L) if n is the number of ids.

The size of the largest group can be found by

df2[, uniqueN(id), by = group][, max(V1)]

The total number of rows in the final result can be computed in advance by

df2[, uniqueN(id), by = group][, sum(choose(V1, 2L))]

How to Expand a Large Dataframe in R