Joining aggregated values back to the original data frame
One line of code does the trick:
new <- ddply( df, "group1", transform, numcolwise(mean))
new
group1 group2 values meanValue
1 1 A 0.48742905 -0.121033381
2 1 A -0.04493361 -0.121033381
3 1 C -0.62124058 -0.121033381
4 1 C -0.30538839 -0.121033381
5 2 A 1.51178117 0.004803931
6 2 B 0.73832471 0.004803931
7 2 A -0.01619026 0.004803931
8 2 B -2.21469989 0.004803931
9 3 B 1.12493092 0.758597929
10 3 C 0.38984324 0.758597929
11 3 B 0.57578135 0.758597929
12 3 A 0.94383621 0.758597929
identical(df, new)
[1] TRUE
Merge and aggregate result to original data
Here is another option with data.table
. We convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'category', we assign (:=
) the sum
of 'quantity' as the new column ('sum.category').
library(data.table)
setDT(df1)[,sum.category:= sum(quantity) , category]
df1
# id category quantity sum.category
#1: 1 AB 235 601
#2: 2 BC 987 987
#3: 3 AB 366 601
#4: 4 CD 287 287
Or using base R
df1$sum.category <- with(df1, ave(quantity, category, FUN=sum))
Pandas - GroupBy and then Merge on original table
By default, groupby
output has the grouping columns as indicies, not columns, which is why the merge is failing.
There are a couple different ways to handle it, probably the easiest is using the as_index
parameter when you define the groupby object.
po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)
Then, your merge should work as expected.
In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po \
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150
SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140
Spark groupBy aggregation result joined back to the initial data frame
It is correct (yields expected results) and idiomatic. DataFrame
DSL is just a wrapper around SQL and standard SQL solution can be expressed as follows:
WITH means AS (SELECT category, avg(column1) AS mean FROM df GROUP BY category)
SELECT df.category, df.column1, means.mean
FROM df JOIN means ON df.category = means.category
You can easily check that this generates the same execution plan as df_with_mean
.
It is possible to express the same logic using window functions:
SELECT *, avg(column1) OVER w AS mean FROM df
WINDOW w AS (
PARTITION BY category
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
with DSL equivalent:
val w = Window.partitionBy($"category").rowsBetween(Long.MinValue, Long.MaxValue)
df.select($"*", avg($"column1").over(w).alias("mean"))
but in general Spark doesn't perform particularly well with UNBOUNDED FOLLOWING
frame.
rerunning agg on pandas groupby object modifies the original dataframe
The issue is that merge_probe_trial_dicts
mutates the original list that is in df4
instead of creating a new one.
Just add .copy()
as below and you should be good.
def merge_probe_trial_dicts(x, only_last=None):
out = {}
for d in x.dropna():
for k, v in d.items():
if k not in out.keys():
out[k] = v.copy() # This is the trick
else:
out[k].extend(v)
if only_last:
for k,v in out.items():
out[k] = v[-only_last:]
return out
Additional trick: the extra arguments to agg
are passed to the aggregation function. So you can directly do:
df4.groupby(["subject", "task_nr"]).agg(merge_probe_trial_dicts, only_last=30)
instead of:
mptd_only30 = partial(merge_probe_trial_dicts, only_last=30)
df4.groupby(["subject", "task_nr"]).agg(mptd_only30)
Another one: if k not in out.keys()
is pretty inefficient since you recompute out.keys()
each time. I suggest using defaultdict
instead for simpler and more efficient code:
from collections import defaultdict
def merge_probe_trial_dicts(x, only_last=None):
out = defaultdict(list)
for d in x.dropna():
for key, value in d.items():
out[key] += value
if only_last is not None:
for key, value in out.items():
out[key] = value[-only_last:]
return out
groupby and select mode and join back onto original dataframe
If you want mode you can use:
from statistics import mode
df['Mode'] = df.groupby(['Name'])['Numbers'].transform(mode)
Related Topics
Directly Creating Dummy Variable Set in a Sparse Matrix in R
Merge Multiple Spaces to Single Space; Remove Trailing/Leading Spaces
Is There a Way of Manipulating Ggplot Scale Breaks and Labels
What Is Integer Overflow in R and How Can It Happen
Stacked Barplot with Colour Gradients for Each Bar
Calculating Mean for Every N Values from a Vector
Convert a Dataframe to Presence Absence Matrix
Sort a Data.Table Fast by Ascending/Descending Order
Data Table Merge Based on Date Ranges
Determining Utm Zone (To Convert) from Longitude/Latitude
R - How to Get Row & Column Subscripts of Matched Elements from a Distance Matrix
How to Delete Columns That Contain Only Nas
Common Legend for Multiple Plots in R
Change the Default Colour Palette in Ggplot
Calculate Multiple Aggregations on Several Variables Using Lapply(.Sd, ...)
Comparing Two Vectors in an If Statement