How to get unique values from multiple columns in a pandas groupby
You can do it with apply
:
import numpy as np
g = df.groupby('c')['l1','l2'].apply(lambda x: list(np.unique(x)))
Pandas, for each unique value in one column, get unique values in another column
Here are two strategies to do it. No doubt, there are other ways.
Assuming your dataframe looks something like this (obviously with more columns):
df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})
>>> df
author subreddit
0 a sr1
1 a sr2
2 b sr2
...
SOLUTION 1: groupby
More straightforward than solution 2, and similar to your first attempt:
group = df.groupby('author')
df2 = group.apply(lambda x: x['subreddit'].unique())
# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())
Result:
>>> df2
author
a [sr1, sr2]
b [sr2]
The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).
If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:
df2 = df2.apply(pd.Series)
Result:
>>> df2
0 1
author
a sr1 sr2
b sr2 NaN
Solution 2: Iterate through dataframe
you can make a new dataframe with all unique authors:
df2 = pd.DataFrame({'author':df.author.unique()})
And then just get the list of all unique subreddits they are active in, assigning it to a new column:
df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']]))
for _, x in df2.iterrows()]
This gives you this:
>>> df2
author subreddits
0 a [sr2, sr1]
1 b [sr2]
Aggregate unique values from multiple columns with pandas GroupBy
Use groupby
and agg
, and aggregate only unique values by calling Series.unique
:
df.astype(str).groupby('prop1').agg(lambda x: ','.join(x.unique()))
prop2 prop3 prop4
prop1
K20 12,1,66 travis,leo 10.0,4.0
L30 3,54,11,10 bob,john 11.2,10.0
df.astype(str).groupby('prop1', sort=False).agg(lambda x: ','.join(x.unique()))
prop2 prop3 prop4
prop1
L30 3,54,11,10 bob,john 11.2,10.0
K20 12,1,66 travis,leo 10.0,4.0
If handling NaNs is important, call fillna
in advance:
import re
df.fillna('').astype(str).groupby('prop1').agg(
lambda x: re.sub(',+', ',', ','.join(x.unique()))
)
prop2 prop3 prop4
prop1
K20 12,1,66 travis,leo 10.0,4.0
L30 3,54,11,10 bob,john 11.2,10.0
unique combinations of values in selected columns in pandas data frame and count
You can groupby
on cols 'A' and 'B' and call size
and then reset_index
and rename
the generated column:
In [26]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
update
A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size
which returns the number of unique groups:
In[202]:
df1.groupby(['A','B']).size()
Out[202]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
So now to restore the grouped columns, we call reset_index
:
In[203]:
df1.groupby(['A','B']).size().reset_index()
Out[203]:
A B 0
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
This restores the indices but the size aggregation is turned into a generated column 0
, so we have to rename this:
In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[204]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3
groupby
does accept the arg as_index
which we could have set to False
so it doesn't make the grouped columns the index, but this generates a series
and you'd still have to restore the indices and so on....:
In[205]:
df1.groupby(['A','B'], as_index=False).size()
Out[205]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64
How can I merge rows by same value in a column in Pandas with aggregation functions?
You are looking for
aggregation_functions = {'price': 'sum', 'amount': 'sum', 'name': 'first'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)
which gives
price name amount
id
1 130 anna 3
2 42 bob 30
3 3 charlie 110
Multiple rows to single cell space delimited values in pandas with group by
Convert the 'value' column from int to string, then perform a groupby
on 'id' and apply
the str.join
function:
# Convert 'value' column to string.
df1['value'] = df1['value'].astype(str)
# Perform a groupby and apply a string join.
df1 = df1.groupby('id')['value'].apply(' '.join).reset_index()
The resulting output:
id value
0 1 67 45
1 2 7 5 9
Pandas aggregate count distinct
How about either of:
>>> df
date duration user_id
0 2013-04-01 30 0001
1 2013-04-01 15 0001
2 2013-04-01 20 0002
3 2013-04-02 15 0002
4 2013-04-02 30 0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
duration user_id
date
2013-04-01 65 2
2013-04-02 45 1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
duration user_id
date
2013-04-01 65 2
2013-04-02 45 1
Aggregate pandas dataframe but collapse duplicate cell values
If I Understand Correctly:
try via groupby()
+agg()
and use set
for unique values instead of list
:
df=df.groupby('query').agg(lambda x:' | '.join(set(x)))
OR
If order is important then use pd.unique()
for unique values:
df=df.groupby('query').agg(lambda x:' | '.join(pd.unique(x)))
OR
If want to perform on selected columns then create a list of those columns and perform aggregration only on those columns:
cols=['knum','definition','A','B','C']
df=df.groupby('query')[cols].agg(lambda x:' | '.join(set(x)))
Related Topics
Remove Unused Categorical Values Boxplot - R
Ggplot Legend - Scale_Colour_Manual Not Working
Add Column to Data Frame Which Returns 1 If String Match a Certain Pattern
Saving Dynamic UI to Global R Workspace
Is There a Table or Catalog of Aesthetics for Ggplot2
Developing Shiny App as a Package and Deploying It to Shiny Server
Write.Table Writes Unwanted Leading Empty Column to Header When Has Rownames
Count Total Missing Values by Group
Constroptim in R - Init Val Is Not in the Interior of the Feasible Region Error
Stacking Multiple Columns Using Pivot Longer in R
R Predict Function Returning Too Many Values
How to Insert Missing Observations on a Data Frame
R - Unable to Install R Packages - Cannot Open the Connection
R: Formatting Plotly Hover Text
How to Ignore Na in Ifelse Statement