Calculate Average Over Multiple Data Frames

Combine multiple data frames and calculate average

You can do:

library(data.table)

rbindlist(list(JPL.GRACE,GFZ.GRACE,CSR.GRACE))[,lapply(.SD,mean), list(Lon, Lat)]

Explanations:

Your data.frames are put into a list and 'superposed horizontaly' using rbindlist (which returns a data.table). We do this since your data.frame has the same structure (same number and name of columns, same type of data).
An alternative approach would have been to do do.call(rbind, list(JPL.GRACE,GFZ.GRACE,CSR.GRACE)).

We then loop over each distinct pair of Lon, Lat. .SD represents the data.table associated with each pair. You can see it by doing:

dt = rbindlist(list(JPL.GRACE,GFZ.GRACE,CSR.GRACE))
dt[,print(.SD), list(Lon, Lat)]

For each of these .SD, we simply loop over the columns and compute the means.

Get the mean across multiple Pandas DataFrames

Assuming the two dataframes have the same columns, you could just concatenate them and compute your summary stats on the concatenated frames:

import numpy as np
import pandas as pd

# some random data frames
df1 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))
df2 = pd.DataFrame(dict(x=np.random.randn(100), y=np.random.randint(0, 5, 100)))

# concatenate them
df_concat = pd.concat((df1, df2))

print df_concat.mean()
# x -0.163044
# y 2.120000
# dtype: float64

print df_concat.median()
# x -0.192037
# y 2.000000
# dtype: float64

Update

If you want to compute stats across each set of rows with the same index in the two datasets, you can use .groupby() to group the data by row index, then apply the mean, median etc.:

by_row_index = df_concat.groupby(df_concat.index)
df_means = by_row_index.mean()

print df_means.head()
# x y
# 0 -0.850794 1.5
# 1 0.159038 1.5
# 2 0.083278 1.0
# 3 -0.540336 0.5
# 4 0.390954 3.5

This method will work even when your dataframes have unequal numbers of rows - if a particular row index is missing in one of the two dataframes, the mean/median will be computed on the single existing row.

calculate average over multiple data frames

If I understand you correctly, on a given DB system, in each "iteration" (1...N) you are loading a sequence of DataSets (1,2,3) and running queries on them. It seems like at the end you want to calculate the average time across all iterations, for each DataSet. If so, you actually need to have an additional column DataSet in your all_results table that identifies the DataSet. We can add this column as follows:

all_results <- cbind( data.frame( DataSet = rep(1:3,3) ), all_results )
> all_results
DataSet iteration lines loadTime query1 query2 query3
1 1 1 100000 120.4 0.5 6.4 1.2
2 2 1 100000 110.1 0.1 5.2 2.1
3 3 1 50000 130.3 0.2 4.3 2.2
4 1 2 100000 120.4 0.1 2.4 1.2
5 2 2 100000 300.2 0.2 4.5 1.4
6 3 2 50000 235.3 0.4 4.2 0.5
7 1 3 100000 233.5 0.7 8.3 6.7
8 2 3 100000 300.1 0.9 0.5 4.4
9 3 3 50000 100.2 0.4 9.2 1.2

Now you can use the ddply function from the plyr package to easily extract the averages for the load and query times for each DataSet.

> ddply(all_results, .(DataSet), colwise(mean, .(loadTime, query1, query2)))
DataSet loadTime query1 query2
1 1 158.1000 0.4333333 5.7
2 2 236.8000 0.4000000 3.4
3 3 155.2667 0.3333333 5.9

Incidentally, I highly recommend you look at Hadley Wickham's plyr package for a rich set of data-manipulation functions

Average Cells of Two or More DataFrames

We can do this with Reduce with + and divide by the number of datasets in a list. This has the flexibility of keeping 'n' number of datasets in a list

dfAvg <- Reduce(`+`, mget(paste0("df", 1:3)))/3

Or another option is to convert to array and then use apply, which also have the option of removing the missing values (na.rm=TRUE)

apply(array(unlist(mget(paste0("df", 1:3))), c(dim(df1), 3)), 2, rowMeans, na.rm = TRUE) 

As @user20650 mentioned, rowMeans can be applied directly on the array with the dim

rowMeans(array(unlist(mget(paste0("df", 1:3))), c(dim(df1), 3)), dims=2) 

How to calculate mean and Sd for multiple data frames in R

Try this:

library(dplyr)
library(tidyr)
#Code
new <- df1 %>% bind_rows(df2,.id = 'id') %>%
pivot_longer(-id) %>%
mutate(Var=paste0(name,id)) %>%
group_by(Var,.drop = F) %>%
summarise(Mean=mean(value,na.rm = T),
SD=sd(value,na.rm = T))
#List
List <- split(new,gsub('[a-z]','',new$Var))
List <- lapply(1:length(List), function(x) {names(List[[x]])<-paste0(names(List[[x]]),x);List[[x]]})
#Bind
res <- do.call(bind_cols,List)

Output:

# A tibble: 3 x 6
Var1 Mean1 SD1 Var2 Mean2 SD2
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 mot1 12 0 mot2 3 1.73
2 temp1 12 0 temp2 7 0
3 time1 13 0 time2 4 0

Combine Multiple Dataframes in R by Average (Mixed datatypes)

Put a ROW ID on your tables

df_1 <- read_table("A       B       C
2.3 5 3
12 3 1
0.4 13 2") %>%
rowid_to_column("ROW")

df_2 <- read_table("A B C
4.3 23 1
1 7 2
0.4 10 2") %>%
rowid_to_column("ROW")

df_3 <- read_table("A B C
1.3 3 3
2.2 4 2
12.4 10 1") %>%
rowid_to_column("ROW")

Bind them together in an ensemble

ensamb <- bind_rows(df_1, df_2, df_3)

group_by row and then summarize each one by its own method

ensamb %>% 
group_by(ROW) %>%
summarise(A = mean(A), B = median(B),
C = C[which.max(C)])

# A tibble: 3 x 4
ROW A B C
<int> <dbl> <dbl> <dbl>
1 1 2.63 5 3
2 2 5.07 4 2
3 3 4.4 10 2

Weighted average across multiple dataframes

If you really want to disregard the string column, and you are certain the two df are the same shape, then you can do this:

sel = ['b', 'c']  # numeric columns
df3 = df1.copy()
df3[sel] = 2/3 * df1[sel] + 1/3 * df2[sel]

On your data, df3 is:

       a    b         c
0 hello 2.0 1.333333
1 hello 1.0 1.000000

However, in the more general case, you may have different sizes and your a column may be relevant. Here is an example:

df1 = pd.DataFrame([["hello", 2, 1], ["world", 1, 1]], columns=["a", "b", "c"])
df2 = pd.DataFrame([["world", 2, 2], ["hello", 1, 1]], columns=["a", "b", "c"])

(2/3 * df1.set_index('a').stack() +
1/3 * df1.set_index('a').stack()).groupby(level=[0,1]).mean().unstack().reset_index()

# gives:
a b c
0 hello 2.0 1.0
1 world 1.0 1.0

Calculate mean of multiple dataframes columns stored in List

I noticed two things that are causing you trouble here:

1 - When you subsetting your list like list[[1:3]] it gets read as list[[c(1, 2, 3)]], and finds the 3rd entry (21) of the 2nd column (price) in the 1st element (df1) in the list. This is why doing something like list[1:2] returns a vector (it's pulling out an entire variable) and why list[1:4] returns an error (the list doesn't go 4 levels deep). (answer by @aaron-montgomery, from the comments)

2 - In your last line, you reference a column mean that you've never defined.

If you're trying to get one value that's the mean of all the previous elements, you can nest another loop:

#for each df in list, calculate the mean of the last 3 values of q 
for (i in 3:length(list)) {

# add another loop to calculate the mean
vals <- c()
for (j in (i - 2):i) {
vals <- c(vals, list[[j]]$q)
}

list[[i]][["q_mean"]] <- mean(vals)

}

If you want a different value for each row (where row1 is the mean of the previous 2 row1s, etc), you can just do:

for (i in 3:length(list)) {

list[[i]][["q_mean"]] <- (list[[i - 1]]$q + list[[i - 2]]$q) /2

}

Average of multiple dataframes with the same columns and indices

You can use groupby.mean on the index level after concatenating the dataframes:

pd.concat([v1, v2, v3]).groupby(level=0).mean()

c1 c2 c3
id
ind1 1.333333 2.333333 2.666667
ind2 3.666667 2.333333 3.666667


Related Topics



Leave a reply



Submit