Alternative to Expand.Grid for Data.Frames

Alternative to expand.grid for data.frames

Why not just something like df[rep(1:nrow(df),times = 3),] to extend the data frame, and then add the extra column just as you have above, with df$Time <- rep(1:lengthTime, each=nrRow)?

replicate `expand.grid()` behavior with data.frames using tidyr/data.table

Try with do.call

> do.call(tidyr::expand_grid, df)
# A tibble: 4 x 2
     V1    V2
  <dbl> <dbl>
1   0.3   0.6
2   0.3   0.4
3   0.7   0.6
4   0.7   0.4

> do.call(tidyr::crossing, df)
# A tibble: 4 x 2
     V1    V2
  <dbl> <dbl>
1   0.3   0.4
2   0.3   0.6
3   0.7   0.4
4   0.7   0.6

> do.call(data.table::CJ, df)
    V1  V2
1: 0.3 0.4
2: 0.3 0.6
3: 0.7 0.4
4: 0.7 0.6

expand.grid equivalent to get pandas data frame for prediction in Python

In pandas we have MultiIndex

d =  {'a': [1, 2, 3], 'b': [4, 5]}
out = pd.MultiIndex.from_product(d.values(),names=d.keys()).to_frame().reset_index(drop=True)
Out[58]: 
   a  b
0  1  4
1  1  5
2  2  4
3  2  5
4  3  4
5  3  5

or simple with itertools

import itertools
out = pd.DataFrame(itertools.product(*d.values()),columns=d.keys())
Out[62]: 
   a  b
0  1  4
1  1  5
2  2  4
3  2  5
4  3  4
5  3  5

expand.grid for data.frame using dplyr full_join

I can reproduce the corrupted result for your dfList on my machine. It seems to me I've found out why it's happening.

require(dplyr)

adf <- data.frame(c1 = 7, c1 = 8, jv = 1, check.names = F)
bdf <- data.frame(d1 = 1:3, d2 = letters[1:3], jv = 1)
cdf <- data.frame(v1.x = 1:3, v2 = letters[1:3], jv = 1)
ddf <- data.frame(v2 = 4:5, v2.x = letters[4:5], jv = 1)

full_join(adf, bdf, by = "jv") 

  c1 c1 jv d1 d2
1  7  7  1  1  a
2  7  7  1  2  b
3  7  7  1  3  c

We can notice that having duplicated column names in adf leads to the wrong result of joining. And when we apply a chain of several joins with the help of Reduce, the automatic renaming of duplicated column names happens (with adding .x and .y by default). This may lead to producing another duplicated names (the opposite to the thing it's intended to avoid).

full_join(cdf, ddf, by = "jv")

  v1.x v2.x jv v2.y v2.x
1    1    a  1    4    d
2    1    a  1    5    e
3    2    b  1    4    d
4    2    b  1    5    e
5    3    c  1    4    d
6    3    c  1    5    e

Here we had one duplication of names in different data.frames - column v2, which was replaced with another duplication after applying suffixes - v2.x.

So, to make things work well, we should care about unique names of columns in data.frames we're joining.

I've tried several approaches to get the desired result and want to present what they are.

base R solution using merge, it's made for speed comparison.
an approach using full_join from dplyr package
a data.table solution using sequential merge of dts
a function based on tidyr's unnest
another data.table solution which first generates the key-table of target resulting length (with the help of CJ) and then makes several left joins
the same as previous but using on parameter for joining instead of setting keys

require(data.table)
require(dplyr)
require(tidyr)
require(stringi)
require(microbenchmark)

expand.grid.df_base <- function(...) {
  dfList <- list(...)
  if (length(dfList) == 1) dfList <- dfList[[1]]

  if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))

  lapply(1:length(dfList), function(i) 
    data.frame(dfN = i, colN = 1:length(dfList[[i]]), 
               dfname = names(dfList)[i], colname = names(dfList[[i]]),
               stringsAsFactors = F)) %>% bind_rows %>%
    mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>% 
    mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names

  for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]

  Reduce(function(x, y) merge(x, y, by = NULL), dfList) %>% setNames(manage_names$out_names)
}

expand.grid.df_dplyr <- function(...) {
  dfList <- list(...)
  if (length(dfList) == 1) dfList <- dfList[[1]]

  if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))

  lapply(1:length(dfList), function(i) 
    data.frame(dfN = i, colN = 1:length(dfList[[i]]), 
               dfname = names(dfList)[i], colname = names(dfList[[i]]),
               stringsAsFactors = F)) %>% bind_rows %>%
    mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>% 
    mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names

  for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]

  joinvar <- stri_rand_strings(1, 12)

  Reduce(function(x, y) {
    mutate_def <- list(1L)
    names(mutate_def) <- joinvar

    full_join(x %>% mutate_(.dots = mutate_def), y %>% mutate_(.dots = mutate_def), by = joinvar)
  }, dfList) %>% select(-contains(joinvar)) %>% setNames(manage_names$out_names) %>% tbl_df
}

expand.grid.dt <- function(...) {
  dtList <- list(...)
  if (length(dtList) == 1) dtList <- dtList[[1]]

  if (!all(sapply(dtList, is.data.table))) dtList <- lapply(dtList, as.data.table)

  if (is.null(names(dtList))) setnames(dtList, paste0("dt", 1:length(dtList)))

  lapply(1:length(dtList), function(i) 
    data.frame(dfN = i, colN = 1:length(dtList[[i]]), 
               dfname = names(dtList)[i], colname = names(dtList[[i]]),
               stringsAsFactors = F)) %>% bind_rows %>%
    mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>% 
    mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names

  for (i in 1:nrow(manage_names)) setnames(dtList[[manage_names$dfN[i]]], old = manage_names$colN[i], new = manage_names$dum_names[i])

  joinvar <- stri_rand_strings(1, 12)

  setnames(Reduce(function(x, y) merge(copy(x)[,(joinvar) := 1], copy(y)[,(joinvar) := 1], 
                                       by = joinvar, all = T, allow.cartesian = T), dtList)[,(joinvar) := NULL],
           manage_names$out_names)[]
}

expand.grid.df_tidyr <- function(...) {
  dfList <- list(...)
  if (length(dfList) == 1) dfList <- dfList[[1]]

  if (is.null(names(dfList))) names(dfList) <- paste0("df", 1:length(dfList))

  lapply(1:length(dfList), function(i) 
    data.frame(dfN = i, colN = 1:length(dfList[[i]]), 
               dfname = names(dfList)[i], colname = names(dfList[[i]]),
               stringsAsFactors = F)) %>% bind_rows %>%
    mutate(dum_names = stri_rand_strings(nrow(.), 12)) %>% rowwise %>% 
    mutate(out_names = paste(dfname, colname, sep = ".")) %>% ungroup -> manage_names

  for (i in 1:nrow(manage_names)) names(dfList[[manage_names$dfN[i]]])[manage_names$colN[i]] <- manage_names$dum_names[i]

  Reduce(function(x, y) x %>% rowwise %>% mutate(dfcol = list(y)) %>% ungroup %>% unnest(dfcol), dfList) %>% 
    setNames(manage_names$out_names) %>% tbl_df
}

expand.grid.dt2 <- function(...) {
  dtList <- list(...)
  if (length(dtList) == 1) dtList <- dtList[[1]]

  dum_names <- stri_rand_strings(length(dtList), 12)

  dtList <- lapply(1:length(dtList), function(i) 
    setkeyv(as.data.table(dtList[[i]])[, (dum_names[i]) := .I], dum_names[i]))

  Reduce(function(result, dt) setkeyv(result, names(result)[1])[dt][, (names(result)[1]) := NULL],
         dtList,
         setnames(do.call(CJ, c(sapply(dtList, function(df) seq_len(nrow(df))), list(sorted = F))), dum_names))[]
}

expand.grid.dt3 <- function(...) {

  dtList <- list(...)
  if (length(dtList) == 1) dtList <- dtList[[1]]

  dum_names <- stri_rand_strings(length(dtList), 12)

  dtList <- lapply(1:length(dtList), function(i) as.data.table(dtList[[i]])[, (dum_names[i]) := .I])

  Reduce(function(result, dt) result[dt, on = names(result)[1]][, (names(result)[1]) := NULL],
         dtList,
         setnames(do.call(CJ, c(sapply(dtList, function(df) seq_len(nrow(df))), list(sorted = F))), dum_names))[]
}

Now lets create lists of data.frames for testing this functions.

set.seed(1)
bigdfList <- data.frame(type = sample(letters[1:10], 50, T),
                        categ = sample(LETTERS[1:10], 50, T),
                        num = sample(100L:500L, 50, T),
                        val = rnorm(50)) %>% split(., .$type)

smalldfList <- data.frame(type = sample(letters[1:5], 50, T),
                          categ = sample(LETTERS[1:5], 50, T),
                          num = sample(100L:500L, 50, T),
                          val = rnorm(50)) %>% split(., .$type)

The expand joinig of smalldfList produces a table of dimension [60,480 x 20] and of bigdfList - [6,451,200 x 40] which occupies 1230.5 MB of RAM.

Start with smalldfList.

microbenchmark(expand.grid.df_base(smalldfList), expand.grid.df_dplyr(smalldfList), 
               expand.grid.dt(smalldfList), expand.grid.df_tidyr(smalldfList),
               expand.grid.dt2(smalldfList), expand.grid.dt3(smalldfList), times = 10)

Unit: milliseconds
                              expr       min        lq      mean    median        uq       max neval cld
  expand.grid.df_base(smalldfList) 178.36192 188.54955 201.28729 198.79644 209.86934 229.85360    10  b 
 expand.grid.df_dplyr(smalldfList)  16.04555  16.91327  18.91094  17.64907  18.45307  29.58192    10 a  
       expand.grid.dt(smalldfList)  20.33188  21.42275  26.30034  23.22873  31.66666  39.37922    10 a  
 expand.grid.df_tidyr(smalldfList) 722.06572 738.02188 801.41820 792.23725 859.96186 905.99190    10   c
      expand.grid.dt2(smalldfList)  32.22650  33.68353  36.89386  36.39713  37.39182  48.93550    10 a  
      expand.grid.dt3(smalldfList)  29.13399  30.69299  34.51265  34.03198  37.48651  41.73543    10 a

So, tidyr solution is not an option here at all, base merge is also quite slow. Other 4 functions on the bigdfList show following efficiency.

microbenchmark(expand.grid.df_dplyr(bigdfList), expand.grid.dt(bigdfList), 
               expand.grid.dt2(bigdfList), expand.grid.dt3(bigdfList), times = 10)

Unit: seconds
                            expr       min        lq      mean    median        uq       max neval  cld
 expand.grid.df_dplyr(bigdfList)  1.326336  1.354706  1.456805  1.449781  1.481836  1.703158    10 a   
       expand.grid.dt(bigdfList)  1.763174  1.820004  1.894813  1.893910  1.939879  2.127097    10  b  
      expand.grid.dt2(bigdfList) 14.164731 14.332872 14.452933 14.452221 14.551982 14.740852    10    d
      expand.grid.dt3(bigdfList) 10.589517 10.828548 11.104010 11.021519 11.368172 11.976976    10   c

And the dplyr::full_join solution has the best result!

Maybe, it's one of the options where dplyr is really better than data.table, maybe it's my lack of data.table knowledge, which has prevented me from making a really fast function :-)

R: extract inner higher level combinations (groups of 1, 2, 3, and 4 elements) out of a data frame of combinations of 5 elements

I'm not completely certain I've understood what you're looking for, but from the second question it looks like you are looking for all cross-combinations of columns within a data.frame.

Disclaimer: The two answers already provided are more readable, where I focus on speed.

As you are performing what is often known as a cross-join (or outer-full-join) one aspect that quickly becomes a concern as n increases is efficiency. For efficiency it helps to split the problem into smaller sub-problems, and find a solution for each problem. As we need to find all unique combinations within the set of columns including the null set (value = NA), we can reduce this problem into 2 sub-problems.

Find unique values for each column including the null set
Expand this set to include all combinations of each column.

Using this idea we can quickly concoct a simple solution using expand.grid, unique and lapply. The only tricky part is to include the null set, but we can do this by selecting NA row from the data.frame including all rows.

# Create null-set-included data.frame
nullset_df <- plusminus_df[c(NA, seq_len(nrow(plusminus_df))), ]
# Find all unique elements, including null set
unique_df <- lapply(nullset_df, unique)
# Combine all unique sets
expand.grid(unique_df)

or as a function

nullgrid.expand <- function(df, ...)
  expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)

This is fairly fast (benchmarks and performance graphs later), but I wanted to go one step further. The data.table package is known for it's high-performance functions, and one of those functions in the CJ function, for performing cross-joins. Below is one implementation using CJ

library(data.table)
nullgrid.expand.dt <- function(df, ...)
  do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                       sorted = FALSE,
                       unique = TRUE))

The function requires vector input, forcing one to use do.call (or similar) which makes the function slightly less readable. But is there any performance gain? To test this, I ran a microbenchmark on the two functions and the ones provided by the existing answers (code below), the result is visualized in a violin plot below:

microbenchmark-plot

From the plot it is quite clear that @pauls answer outperforms @ekoam's answer, but the two functions above both outperform the provided answers in terms of speed. But the question said that the input might have any number of dimension, so there is also the question of how well our function scales with the number of columns and the number of unique values (here we only have "+" and "-" but what if we had more?). For this I reran the benchmark for n_columns = 3, 4, ..., 10 and n_values = 2, 4, ... 10. The 2 results are visualized with smooth curves below.

First we'll visualize the time as a function of number of columns. Note that the y axis is on logarithmic scale (base 10) for easier comparison.

benchmark-column-count

From the visualization it is quite clear that, with increasing number of columns, the choice of method becomes very important. The suggestion by @ekoam becomes very slow, primarily because it delays a call to unique till the very end. The remaining 3 methods are all much faster, while nullgrid.expand.dt becomes more than 10 times faster in comparison to the remaining methods once we get more than 8 columns of data.

Next lets look at the timing compared to the number of values in each column (n-columns fixed at 5)

benchmark-value-count

Again we see a similar picture. Except for a single possible outlier for nullgrid.expand, which seems to become slower than the answer by paul as the number of unique values increase, we see that nullgrid.expand.dt remains faster, although here it seems to only be saving (exp(4) - exp(3.6)) / exp(3.6) ~ 50 % (or twice as fast) compared to paul's answer by the time we reach 10 unique values.

Please note that I did not have enough RAM to run the benchmark for number of unique values or columns greater than the ones shown.

Conclusion

We've seen that there are many ways to reach the answer sought by the question, but as the number of columns and unique values increase the choice of method becomes more and more important. By utilizing optimized libraries, we can drastically reduce the time required to get the cross-join of all column values, with only minimal effort. With extended effort using Rcpp we could likely reduce the time complexity even further, while this is outside the scope of my answer.

Benchmark code

# Setup:
set.seed(1234)
library(tidyverse)
library(data.table)
nullgrid.expand <- function(df, ...)
  expand.grid(lapply(df[c(NA, seq_len(nrow(df))), ], unique), ...)
nullgrid.expand.dt <- function(df, ...)
  do.call(CJ, args = c(as.list(df[c(NA, seq_len(nrow(df))), ]),
                       sorted = FALSE,
                       unique = TRUE))
markers=LETTERS[1:5]
plusminus_df <- expand.grid(lapply(seq(markers), function(x) c("+","-")), stringsAsFactors=FALSE)
names(plusminus_df)=LETTERS[1:5]

bm <- microbenchmark(
  nullgrid.expand = nullgrid.expand(plusminus_df),
  nullgrid.expand.dt = nullgrid.expand.dt(plusminus_df),
  ekoam = unique(bind_rows(apply(
    plusminus_df, 1L, 
    function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
  ))),
  paul = {
    plusminus_df %>%
      add_row() %>%
      map(unique) %>%
      expand.grid()
  }, 
  control = list(warmup = 5)
)
library(ggplot2)
autoplot(bm) + ggtitle('comparison between cross-join')

Timing function

time_function <- function(n = 5, j = 2){
  idx <- seq_len(n)
  df <- do.call(CJ, args = c(lapply(idx, function(x) as.character(seq_len(j))),
                             sorted = FALSE,
                             unique = TRUE))
  names(df) <- as.character(idx)
  microbenchmark(
    nullgrid.expand = nullgrid.expand(df),
    nullgrid.expand.dt = nullgrid.expand.dt(df),
    ekoam = unique(bind_rows(apply(
      df, 1L, 
      function(r) head(expand.grid(lapply(r, c, NA_character_), stringsAsFactors = FALSE), -1L)
    ))),
    paul = {
      df %>%
        add_row() %>%
        map(unique) %>%
        expand.grid()
    }, 
    times = 10,
    control = list(warmup = 5)
  )
}
res <- lapply(seq(3, 10), time_function)
for(i in seq_along(res)){
  res[[i]]$n <- seq(3, 10)[i]
}
ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
  geom_smooth(se = FALSE) + 
  ggtitle('time-comparison given number of columns') + 
  labs(y = 'log(ms)', x = 'n')
ggsave('so_2.png')

res <- lapply(c(seq(2, 10, 2)), time_function, n = 5)
for(i in seq_along(res)){
  res[[i]]$n <- seq(2, 10, 2)[i]
}
ggplot(rbindlist(res), aes(x = n, y = log(time / 10^4, base = 10), col = expr)) + 
  geom_smooth(se = FALSE) + 
  ggtitle('time-comparison given number of unique values') + 
  labs(y = 'log(ms)', x = 'n unique values per column')
ggsave('so_3.png')

Generate combination of data frame and vector

This may not scale when your dataframe has more than two columns per row, but you can just use expand.grid on the first column and then merge the second column in.

df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
combined <- expand.grid(a=df$a, c=c)
combined <- merge(combined, df)
> combined[order(combined$c), ]
  a  c b
1 1  9 5
3 2  9 6
5 3  9 7
2 1 10 5
4 2 10 6
6 3 10 7

Combine all possible rows of data frame in R

tidyr::complete() is designed for this. I'm surprised I don't see a vanilla example on SO.

library(magrittr)
x %>% 
  tidyr::complete(Col1, Col2)

Result:

# A tibble: 16 x 2
   Col1  Col2 
   <fct> <fct>
 1 A     W    
 2 A     X    
 3 A     Y    
 4 A     Z    
 5 B     W    
 6 B     X    
 7 B     Y    
 8 B     Z    
 9 C     W    
10 C     X    
11 C     Y    
12 C     Z    
13 D     W    
14 D     X    
15 D     Y    
16 D     Z

If your real-world scenario is as simple as the OP, @bouncyball's suggestion of expand.grid(x) is the cleanest. If your real-world scenario has more complexity, then tidyr::complete() might allow you to grow more easily. I commonly have more than the two ID variables to expand/complete. These are typically the analyses' dependent/outcome variables, and the fill parameter allows you to specify their default value for combinations that don't appear in the observed dataset. Here's an SO example.

edited to reflect advice of @bouncyball and @ADuv.

How do you create a data frame using combinations of character strings?

expand.grid() will give you all the possible combinations of its vector arguments:

number_of_variables<-2

output<-expand.grid(data.frame(replicate(number_of_variables, list))

>output
     X1    X2
1   liz   liz
2  doug   liz
3 stacy   liz
4   liz  doug
5  doug  doug
6 stacy  doug
7   liz stacy
8  doug stacy
9 stacy stacy

To have every combination as a column call transpose(output)

If you want unique combinations (discard those with the same elements), use combn():

> data.frame(combn(list, number_of_variables))

    X1    X2    X3
1  liz   liz  doug
2 doug stacy stacy

expand each unique combination with vector of dates

We can use crossing from tidyr

tidyr::crossing(df, dates)

# A tibble: 6 x 3
#   d     t     dates     
#  <fct> <fct> <date>    
#1 d1    t1    2016-05-01
#2 d1    t1    2016-06-01
#3 d1    t1    2016-07-01
#4 d2    t2    2016-05-01
#5 d2    t2    2016-06-01
#6 d2    t2    2016-07-01

Alternative to Expand.Grid for Data.Frames