De-aggregate / reverse-summarise / expand a dataset in R
Without packages we can repeat each row according to the frequencies given:
df2 <- df[rep(1:nrow(df), df[,5]),-5]
De-aggregate a data frame
Here's a tidyverse solution.
As you say, it's easy to repeat a row an arbitrary number of times. If you know that row_number()
counts rows within groups when a data frame is grouped, then it's easy to convert grouped counts to presence/absence flags. across
gives you a way to succinctly convert multiple count columns.
library(tidyverse)
tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3)) %>%
uncount(total_N) %>%
group_by(group) %>%
mutate(
across(
starts_with("measure"),
function(x) as.numeric(row_number() <= x)
)
) %>%
ungroup()
# A tibble: 9 × 3
group measure_A measure_B
<chr> <dbl> <dbl>
1 A 1 1
2 A 0 1
3 A 0 0
4 A 0 0
5 B 1 1
6 B 1 1
7 B 1 1
8 B 1 0
9 B 0 0
As you say, this approach takes no account of correlations between the outcome columns, as this cannot be deduced from the grouped data.
Repeat each row of data.frame the number of times specified in a column
Here's one solution:
df.expanded <- df[rep(row.names(df), df$freq), 1:2]
Result:
var1 var2
1 a d
2 b e
2.1 b e
3 c f
3.1 c f
3.2 c f
Opposite of 'summarise' in dplyr: turn one row into many
This appears to be currently impossible, but under active discussion by the developers with a target version of 0.5.
Note data.table
currently allows this (see @akrun's comment), and also allows you to have arbitrary sized group outputs with arbitrary sized group inputs, whereas it seems like the solution being discussed with dplyr
would require all groups to be the same size. Here is an example:
> data.table(a=1:3)[, paste(a, seq(a), sep=":"), by=a]
a V1
1: 1 1:1
2: 2 2:1
3: 2 2:2
4: 3 3:1
5: 3 3:2
6: 3 3:3
Additionally, based on @AlexBrown's comment, you could do:
unnest(testdf, a3)
for your specific example, but that does not seem to work with the group_by
/ summarize
workflow for reasons described above (i.e. you can't create testdf
directly with just dplyr::group_by
, AFAIK).
lump factor based on another column
The key here is to apply a specific philosophy in order to group factories together based on their sum of production. Note that this philosophy has to do with the actual values you have in your (real) dataset.
Option 1
Here's an example that groups together factories that have a sum production equal to 15 or less. If you want another grouping you can modify the threshold (e.g. use 18 instead of 15)
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
I'm creating factory_new
without removing the (original) factory
column.
Option 2
Here's an example where you can rank / order the factories based on their production and then you can pick a number of top factories to keep as they are and group the rest
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
# get ranked factories based on sum production
df %>%
group_by(factory) %>%
summarise(SumProd = sum(production)) %>%
arrange(desc(SumProd)) %>%
pull(factory) -> vec_top_factories
# input how many top factories you want to keep
# rest will be grouped together
n = 2
# apply the grouping based on n provided
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
When object is constructed statically inside a function, would it be allocated on the heap or on the stack?
On the stack.
Memory is only allocated on the heap when doing new
(or malloc
and its friends if you are doing things C-style, which you shouldn't in C++).
Related Topics
Why Does As.Matrix Add Extra Spaces When Converting Numeric to Character
Rstudio Calls Source() When Saving Script
R: Matrix by Vector Multiplication
How to Capture the Output of System()
Error in Bind_Rows_(X, .Id):Argument 1 Must Have Names
Create a Histogram for Weighted Values
How to Rbind Only the Common Columns of Two Data Sets
Rhtml: Warning: Conversion Failure on '<Var>' in 'Mbcstosbcs': Dot Substituted for <Var>
How to Underline Text in a Plot Title or Label? (Ggplot2)
How to Rearrange an Order of Matches Between Two Data Frames
Unexpected Symbol Error in Parse(Text = Str) with Hyphen After a Digit
Write Different Data Frame in One .CSV File with R
Find the Source File Containing R Function Definition
Force a Regular Plot Object into a Grob for Use in Grid.Arrange
R Specify Function Environment