Joining Aggregated Values Back to the Original Data Frame

Joining aggregated values back to the original data frame

One line of code does the trick:

new <- ddply( df, "group1", transform, numcolwise(mean))
new

group1 group2 values meanValue
1 1 A 0.48742905 -0.121033381
2 1 A -0.04493361 -0.121033381
3 1 C -0.62124058 -0.121033381
4 1 C -0.30538839 -0.121033381
5 2 A 1.51178117 0.004803931
6 2 B 0.73832471 0.004803931
7 2 A -0.01619026 0.004803931
8 2 B -2.21469989 0.004803931
9 3 B 1.12493092 0.758597929
10 3 C 0.38984324 0.758597929
11 3 B 0.57578135 0.758597929
12 3 A 0.94383621 0.758597929

identical(df, new)
[1] TRUE

Merge and aggregate result to original data

Here is another option with data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'category', we assign (:=) the sum of 'quantity' as the new column ('sum.category').

library(data.table)
setDT(df1)[,sum.category:= sum(quantity) , category]
df1
# id category quantity sum.category
#1: 1 AB 235 601
#2: 2 BC 987 987
#3: 3 AB 366 601
#4: 4 CD 287 287

Or using base R

df1$sum.category <- with(df1, ave(quantity, category, FUN=sum))

Pandas - GroupBy and then Merge on original table

By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.

There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.

po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)

Then, your merge should work as expected.

In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po \
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150

SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140

Spark groupBy aggregation result joined back to the initial data frame

It is correct (yields expected results) and idiomatic. DataFrame DSL is just a wrapper around SQL and standard SQL solution can be expressed as follows:

WITH means AS (SELECT category, avg(column1) AS mean FROM df GROUP BY category)
SELECT df.category, df.column1, means.mean
FROM df JOIN means ON df.category = means.category

You can easily check that this generates the same execution plan as df_with_mean.

It is possible to express the same logic using window functions:

SELECT *, avg(column1) OVER w AS mean FROM df
WINDOW w AS (
PARTITION BY category
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)

with DSL equivalent:

val w = Window.partitionBy($"category").rowsBetween(Long.MinValue, Long.MaxValue)

df.select($"*", avg($"column1").over(w).alias("mean"))

but in general Spark doesn't perform particularly well with UNBOUNDED FOLLOWING frame.

rerunning agg on pandas groupby object modifies the original dataframe

The issue is that merge_probe_trial_dicts mutates the original list that is in df4 instead of creating a new one.

Just add .copy() as below and you should be good.

def merge_probe_trial_dicts(x, only_last=None):
out = {}
for d in x.dropna():
for k, v in d.items():
if k not in out.keys():
out[k] = v.copy() # This is the trick
else:
out[k].extend(v)

if only_last:
for k,v in out.items():
out[k] = v[-only_last:]
return out

Additional trick: the extra arguments to agg are passed to the aggregation function. So you can directly do:

df4.groupby(["subject", "task_nr"]).agg(merge_probe_trial_dicts, only_last=30)

instead of:

mptd_only30 = partial(merge_probe_trial_dicts, only_last=30)
df4.groupby(["subject", "task_nr"]).agg(mptd_only30)

Another one: if k not in out.keys() is pretty inefficient since you recompute out.keys() each time. I suggest using defaultdict instead for simpler and more efficient code:

from collections import defaultdict

def merge_probe_trial_dicts(x, only_last=None):
out = defaultdict(list)
for d in x.dropna():
for key, value in d.items():
out[key] += value
if only_last is not None:
for key, value in out.items():
out[key] = value[-only_last:]
return out

groupby and select mode and join back onto original dataframe

If you want mode you can use:

from statistics import mode
df['Mode'] = df.groupby(['Name'])['Numbers'].transform(mode)


Related Topics



Leave a reply



Submit