De-Aggregate/Reverse-Summarise/Expand a Dataset in R

De-aggregate / reverse-summarise / expand a dataset in R

Without packages we can repeat each row according to the frequencies given:

df2 <- df[rep(1:nrow(df), df[,5]),-5]

De-aggregate a data frame

Here's a tidyverse solution.

As you say, it's easy to repeat a row an arbitrary number of times. If you know that row_number() counts rows within groups when a data frame is grouped, then it's easy to convert grouped counts to presence/absence flags. across gives you a way to succinctly convert multiple count columns.

library(tidyverse)

tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3)) %>% 
  uncount(total_N) %>% 
  group_by(group) %>% 
  mutate(
    across(
      starts_with("measure"), 
      function(x) as.numeric(row_number() <= x)
    )
  ) %>%
  ungroup()
# A tibble: 9 × 3
  group measure_A measure_B
  <chr>     <dbl>     <dbl>
1 A             1         1
2 A             0         1
3 A             0         0
4 A             0         0
5 B             1         1
6 B             1         1
7 B             1         1
8 B             1         0
9 B             0         0

As you say, this approach takes no account of correlations between the outcome columns, as this cannot be deduced from the grouped data.

Repeat each row of data.frame the number of times specified in a column

Here's one solution:

df.expanded <- df[rep(row.names(df), df$freq), 1:2]

Result:

    var1 var2
1      a    d
2      b    e
2.1    b    e
3      c    f
3.1    c    f
3.2    c    f

Opposite of 'summarise' in dplyr: turn one row into many

This appears to be currently impossible, but under active discussion by the developers with a target version of 0.5.

Note data.table currently allows this (see @akrun's comment), and also allows you to have arbitrary sized group outputs with arbitrary sized group inputs, whereas it seems like the solution being discussed with dplyr would require all groups to be the same size. Here is an example:

> data.table(a=1:3)[, paste(a, seq(a), sep=":"), by=a]
   a  V1
1: 1 1:1
2: 2 2:1
3: 2 2:2
4: 3 3:1
5: 3 3:2
6: 3 3:3

Additionally, based on @AlexBrown's comment, you could do:

unnest(testdf, a3)

for your specific example, but that does not seem to work with the group_by / summarize workflow for reasons described above (i.e. you can't create testdf directly with just dplyr::group_by, AFAIK).

lump factor based on another column

The key here is to apply a specific philosophy in order to group factories together based on their sum of production. Note that this philosophy has to do with the actual values you have in your (real) dataset.

Option 1

Here's an example that groups together factories that have a sum production equal to 15 or less. If you want another grouping you can modify the threshold (e.g. use 18 instead of 15)

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)

library(dplyr)

df %>%
  group_by(factory) %>%
  mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
  ungroup()

# # A tibble: 9 x 3
#   factory production factory_new
#   <chr>        <dbl> <chr>      
# 1 A               15 A          
# 2 A                2 A          
# 3 B                1 Other      
# 4 B                1 Other      
# 5 B                2 Other      
# 6 B                1 Other      
# 7 B                2 Other      
# 8 C               20 C          
# 9 D                5 Other

I'm creating factory_new without removing the (original) factory column.

Option 2

Here's an example where you can rank / order the factories based on their production and then you can pick a number of top factories to keep as they are and group the rest

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)

library(dplyr)

# get ranked factories based on sum production
df %>%
  group_by(factory) %>%
  summarise(SumProd = sum(production)) %>%
  arrange(desc(SumProd)) %>%
  pull(factory) -> vec_top_factories

# input how many top factories you want to keep
# rest will be grouped together
n = 2

# apply the grouping based on n provided
df %>%
  group_by(factory) %>%
  mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
  ungroup()

# # A tibble: 9 x 3
#   factory production factory_new
#   <chr>        <dbl> <chr>      
# 1 A               15 A          
# 2 A                2 A          
# 3 B                1 Other      
# 4 B                1 Other      
# 5 B                2 Other      
# 6 B                1 Other      
# 7 B                2 Other      
# 8 C               20 C          
# 9 D                5 Other

When object is constructed statically inside a function, would it be allocated on the heap or on the stack?

On the stack.

Memory is only allocated on the heap when doing new (or malloc and its friends if you are doing things C-style, which you shouldn't in C++).

De-Aggregate/Reverse-Summarise/Expand a Dataset in R