How to Merge Multiple Data.Frames and Sum and Average Columns at the Same Time in R

How to merge multiple data.frames and sum and average columns at the same time in R

I think your second approach is the way to go, and you can do that with data.table or dplyr.

Here a few steps using data.table. First, if your data frames are abc, def, ...
do:

DF <- do.call(rbind, list(abc,def,...))

now you can transform them into a data.table

DT <- data.table(DF)

and simply do something like

DTres <- DT[,.(A=sum(A, na.rm=T), B=sum(B, na.rm=T), C=mean(C,na.rm=T)),by=name]

double check the data.table vignettes to get a better idea how that package work.

Combine Multiple Dataframes in R by Average (Mixed datatypes)

Put a ROW ID on your tables

df_1 <- read_table("A       B       C
2.3 5 3
12 3 1
0.4 13 2") %>%
rowid_to_column("ROW")

df_2 <- read_table("A B C
4.3 23 1
1 7 2
0.4 10 2") %>%
rowid_to_column("ROW")

df_3 <- read_table("A B C
1.3 3 3
2.2 4 2
12.4 10 1") %>%
rowid_to_column("ROW")

Bind them together in an ensemble

ensamb <- bind_rows(df_1, df_2, df_3)

group_by row and then summarize each one by its own method

ensamb %>% 
group_by(ROW) %>%
summarise(A = mean(A), B = median(B),
C = C[which.max(C)])

# A tibble: 3 x 4
ROW A B C
<int> <dbl> <dbl> <dbl>
1 1 2.63 5 3
2 2 5.07 4 2
3 3 4.4 10 2

merge multiple data.frames [r]

What about something like this:

l2 <- Reduce(function(x, n) merge(x, l1[[n]], by='nu_pregao', suffixes = c("", n)),
seq(2, length(l1)), init = l1[[1]])
l2
#> nu_pregao pcVar pcVar2 pcVar3
#> 1 2371 7.224848 4.055709 4.011461
#> 2 2372 2.797704 2.944882 3.679907
#> 3 2373 3.947368 3.507937 4.693034

Final touch for names consistency:

names(l2)[match("pcVar", names(l2))] <- "pcVar1"
l2
#> nu_pregao pcVar1 pcVar2 pcVar3
#> 1 2371 7.224848 4.055709 4.011461
#> 2 2372 2.797704 2.944882 3.679907
#> 3 2373 3.947368 3.507937 4.693034

Your data:

l1 <- list(read.table(text = "nu_pregao    pcVar
1 2371 7.224848
45 2372 2.797704
89 2373 3.947368", header = TRUE),

read.table(text = "nu_pregao pcVar
2 2371 4.055709
46 2372 2.944882
90 2373 3.507937", header = TRUE),

read.table(text = "nu_pregao pcVar
3 2371 4.011461
47 2372 3.679907
91 2373 4.693034", header = TRUE))

A set of functions over multiple data frames and merge the outputs in R

Basil. Welcome to StackOverflow.

I was wary of lapply when I first stated using R, but you should stick with it. It's almost always more efficient than using a for loop. In your particular case, you can put your individual data frames in a list and the code you run on each into a function myFunc, say, which takes the data frame you want to process as its argument.

Then you can simply say

allData <- bind_rows(lapply(1:length(dataFrameList), function(x) myFunc(dataFrameList[[x]])))

Incidentally, your column names make me think your data isn't yet tidy. I'd suggest you spend a little time making it so before you do much else. It will save you a huge amount of effort in the long run.

Multiply and average data from two data.frames into one column using R

One dplyr option could be:

df1 %>%
rowwise() %>%
mutate(new = sum(across(df2$p) * df2$q))

a b c d new
<fct> <dbl> <dbl> <dbl> <dbl>
1 a 7.17 14.8 8.45 24.9
2 a 7.42 19.7 3.97 44.7
3 a 5.78 19.2 9.66 29.7
4 a 5.09 17.7 12.8 19.3
5 a 7.21 12.9 6.24 25.2
6 a 2.36 13.7 2.50 27.7
7 a 7.26 10.9 10.7 12.0
8 a 5.45 6.18 12.8 -4.92
9 b 5.43 18.2 9.55 27.3
10 b 4.16 12.1 4.11 23.5

Aggregate multiple columns at once

We can use the formula method of aggregate. The variables on the 'rhs' of ~ are the grouping variables while the . represents all other variables in the 'df1' (from the example, we assume that we need the mean for all the columns except the grouping), specify the dataset and the function (mean).

aggregate(.~id1+id2, df1, mean)

Or we can use summarise_each from dplyr after grouping (group_by)

library(dplyr)
df1 %>%
group_by(id1, id2) %>%
summarise_each(funs(mean))

Or using summarise with across (dplyr devel version - ‘0.8.99.9000’)

df1 %>% 
group_by(id1, id2) %>%
summarise(across(starts_with('val'), mean))

Or another option is data.table. We convert the 'data.frame' to 'data.table' (setDT(df1), grouped by 'id1' and 'id2', we loop through the subset of data.table (.SD) and get the mean.

library(data.table)
setDT(df1)[, lapply(.SD, mean), by = .(id1, id2)]

data

df1 <- structure(list(id1 = c("a", "a", "a", "a", "b", "b", 
"b", "b"
), id2 = c("x", "x", "y", "y", "x", "y", "x", "y"),
val1 = c(1L,
2L, 3L, 4L, 1L, 4L, 3L, 2L), val2 = c(9L, 4L, 5L, 9L, 7L, 4L,
9L, 8L)), .Names = c("id1", "id2", "val1", "val2"),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

R: merging columns and the values if they have the same column name

Solution 1

Using split(), lapply(), rowSums(), and do.call()/cbind():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 2

Replacing the rowSums() call with Reduce()/`+`():

do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Solution 3

Replacing the index vector middleman with splitting the data.frame (as an unclassed list) directly:

do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
## B C U
## [1,] 2 2 1
## [2,] 4 4 2
## [3,] 6 6 3
## [4,] 8 8 4

Benchmarking

library(microbenchmark);

bgoldst1 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) rowSums(df[x])));
bgoldst2 <- function(df) do.call(cbind,lapply(split(seq_len(ncol(df)),names(df)),function(x) Reduce(`+`,df[x])));
bgoldst3 <- function(df) do.call(cbind,lapply(split(as.list(df),names(df)),function(x) Reduce(`+`,x)));
sotos <- function(df) sapply(unique(names(df)), function(i)rowSums(df[names(df) == i]));

df <- data.frame(B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),U=c(1L,2L,3L,4L),B=c(1L,2L,3L,4L),C=c(1L,2L,3L,4L),check.names=F);

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst1(df) 245.473 258.3030 278.9499 272.4155 286.742 641.052 100
## bgoldst2(df) 156.949 166.3580 184.2206 171.7030 181.539 1042.618 100
## bgoldst3(df) 82.110 92.5875 100.9138 97.2915 107.128 170.207 100
## sotos(df) 200.997 211.9030 226.7977 223.6630 235.210 328.010 100

set.seed(1L);
NR <- 1e3L; NC <- 1e3L;
df <- setNames(nm=LETTERS[sample(seq_along(LETTERS),NC,T)],data.frame(replicate(NC,sample(seq_len(NR*3L),NR,T))));

ex <- bgoldst1(df);
all.equal(ex,sotos(df)[,colnames(ex)]);
## [1] TRUE
all.equal(ex,bgoldst2(df));
## [1] TRUE
all.equal(ex,bgoldst3(df));
## [1] TRUE

microbenchmark(bgoldst1(df),bgoldst2(df),bgoldst3(df),sotos(df));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst1(df) 11.070218 11.586182 12.745706 12.870209 13.234997 16.15929 100
## bgoldst2(df) 4.534402 4.680446 6.161428 6.097900 6.425697 44.83254 100
## bgoldst3(df) 3.430203 3.555505 5.355128 4.919931 5.219930 41.79279 100
## sotos(df) 19.953848 21.419628 22.713282 21.829533 22.280279 60.86525 100

sum and average of rows from list of data frames

Instead of the while loop, an option in R would be to get the sum of corresponding elements of list with Reduce and divide by the length of the 'sample_list'

Reduce(`+`, sample_list)/length(sample_list)
# x
#1 6
#2 7
#3 8
#4 9
#5 10

Or a concise approach is rowMeans after converting it to a single data.frame

rowMeans(do.call(cbind, sample_list))


Related Topics



Leave a reply



Submit