Joining Aggregated Values Back to the Original Data Frame

Joining aggregated values back to the original data frame

One line of code does the trick:

new <- ddply( df, "group1", transform, numcolwise(mean))
new

group1 group2      values    meanValue
1       1      A  0.48742905 -0.121033381
2       1      A -0.04493361 -0.121033381
3       1      C -0.62124058 -0.121033381
4       1      C -0.30538839 -0.121033381
5       2      A  1.51178117  0.004803931
6       2      B  0.73832471  0.004803931
7       2      A -0.01619026  0.004803931
8       2      B -2.21469989  0.004803931
9       3      B  1.12493092  0.758597929
10      3      C  0.38984324  0.758597929
11      3      B  0.57578135  0.758597929
12      3      A  0.94383621  0.758597929

identical(df, new)
[1] TRUE

Merge and aggregate result to original data

Here is another option with data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'category', we assign (:=) the sum of 'quantity' as the new column ('sum.category').

library(data.table)
setDT(df1)[,sum.category:= sum(quantity) , category]
df1
#    id category quantity sum.category
#1:  1       AB      235          601
#2:  2       BC      987          987
#3:  3       AB      366          601
#4:  4       CD      287          287

Or using base R

df1$sum.category <- with(df1, ave(quantity, category, FUN=sum))

Pandas - GroupBy and then Merge on original table

By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.

There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.

po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)

Then, your merge should work as expected.

In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]: 
   EID PCODE  SC_Acc  EE_Acc        SI_Acc  PVALUE_Acc  EE_Po  PVALUE_Po  \
0  123    GR     236   40000  1.805222e+31         350  10000         50   
1  123    GR     236   40000  1.805222e+31         350  30000        300   
2  123    GU     443   12000  8.765549e+87         250  10000        100   
3  123    GU     443   12000  8.765549e+87         250   2000        150   

   SC_Po  SI_Po  
0     23     40  
1    213    140  
2    230    400  
3    213    140

Spark groupBy aggregation result joined back to the initial data frame

It is correct (yields expected results) and idiomatic. DataFrame DSL is just a wrapper around SQL and standard SQL solution can be expressed as follows:

WITH means AS (SELECT category, avg(column1) AS mean FROM df GROUP BY category)
SELECT df.category, df.column1, means.mean
FROM df JOIN means ON df.category = means.category

You can easily check that this generates the same execution plan as df_with_mean.

It is possible to express the same logic using window functions:

SELECT *, avg(column1) OVER w AS mean FROM df
WINDOW w AS (
    PARTITION BY category
    ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)

with DSL equivalent:

val w = Window.partitionBy($"category").rowsBetween(Long.MinValue, Long.MaxValue)

df.select($"*", avg($"column1").over(w).alias("mean"))

but in general Spark doesn't perform particularly well with UNBOUNDED FOLLOWING frame.

rerunning agg on pandas groupby object modifies the original dataframe

The issue is that merge_probe_trial_dicts mutates the original list that is in df4 instead of creating a new one.

Just add .copy() as below and you should be good.

def merge_probe_trial_dicts(x, only_last=None):
    out = {}
    for d in x.dropna():
        for k, v in d.items():
            if k not in out.keys():
                out[k] = v.copy()  # This is the trick
            else:
                out[k].extend(v)
                
    if only_last:
        for k,v in out.items():
            out[k] = v[-only_last:]
    return out

Additional trick: the extra arguments to agg are passed to the aggregation function. So you can directly do:

df4.groupby(["subject", "task_nr"]).agg(merge_probe_trial_dicts, only_last=30)

instead of:

mptd_only30 = partial(merge_probe_trial_dicts, only_last=30)
df4.groupby(["subject", "task_nr"]).agg(mptd_only30)

Another one: if k not in out.keys() is pretty inefficient since you recompute out.keys() each time. I suggest using defaultdict instead for simpler and more efficient code:

from collections import defaultdict

def merge_probe_trial_dicts(x, only_last=None):
    out = defaultdict(list)
    for d in x.dropna():
        for key, value in d.items():
            out[key] += value
    if only_last is not None:
        for key, value in out.items():
            out[key] = value[-only_last:]
    return out

groupby and select mode and join back onto original dataframe

If you want mode you can use:

from statistics import mode
df['Mode'] = df.groupby(['Name'])['Numbers'].transform(mode)

Joining Aggregated Values Back to the Original Data Frame