Aggregating Unique Values in Columns to Single Dataframe "Cell"

How to get unique values from multiple columns in a pandas groupby

You can do it with apply:

import numpy as np
g = df.groupby('c')['l1','l2'].apply(lambda x: list(np.unique(x)))

Pandas, for each unique value in one column, get unique values in another column

Here are two strategies to do it. No doubt, there are other ways.

Assuming your dataframe looks something like this (obviously with more columns):

df = pd.DataFrame({'author':['a', 'a', 'b'], 'subreddit':['sr1', 'sr2', 'sr2']})

>>> df
author subreddit
0 a sr1
1 a sr2
2 b sr2
...

SOLUTION 1: groupby

More straightforward than solution 2, and similar to your first attempt:

group = df.groupby('author')

df2 = group.apply(lambda x: x['subreddit'].unique())

# Alternatively, same thing as a one liner:
# df2 = df.groupby('author').apply(lambda x: x['subreddit'].unique())

Result:

>>> df2
author
a [sr1, sr2]
b [sr2]

The author is the index, and the single column is the list of all subreddits they are active in (this is how I interpreted how you wanted your output, according to your description).

If you wanted the subreddits each in a separate column, which might be more useable, depending on what you want to do with it, you could just do this after:

df2 = df2.apply(pd.Series)

Result:

>>> df2
0 1
author
a sr1 sr2
b sr2 NaN

Solution 2: Iterate through dataframe

you can make a new dataframe with all unique authors:

df2 = pd.DataFrame({'author':df.author.unique()})

And then just get the list of all unique subreddits they are active in, assigning it to a new column:

df2['subreddits'] = [list(set(df['subreddit'].loc[df['author'] == x['author']])) 
for _, x in df2.iterrows()]

This gives you this:

>>> df2
author subreddits
0 a [sr2, sr1]
1 b [sr2]

Aggregate unique values from multiple columns with pandas GroupBy

Use groupby and agg, and aggregate only unique values by calling Series.unique:

df.astype(str).groupby('prop1').agg(lambda x: ','.join(x.unique()))

prop2 prop3 prop4
prop1
K20 12,1,66 travis,leo 10.0,4.0
L30 3,54,11,10 bob,john 11.2,10.0

df.astype(str).groupby('prop1', sort=False).agg(lambda x: ','.join(x.unique()))

prop2 prop3 prop4
prop1
L30 3,54,11,10 bob,john 11.2,10.0
K20 12,1,66 travis,leo 10.0,4.0

If handling NaNs is important, call fillna in advance:

import re
df.fillna('').astype(str).groupby('prop1').agg(
lambda x: re.sub(',+', ',', ','.join(x.unique()))
)

prop2 prop3 prop4
prop1
K20 12,1,66 travis,leo 10.0,4.0
L30 3,54,11,10 bob,john 11.2,10.0

unique combinations of values in selected columns in pandas data frame and count

You can groupby on cols 'A' and 'B' and call size and then reset_index and rename the generated column:

In [26]:

df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})
Out[26]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3

update

A little explanation, by grouping on the 2 columns, this groups rows where A and B values are the same, we call size which returns the number of unique groups:

In[202]:
df1.groupby(['A','B']).size()

Out[202]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64

So now to restore the grouped columns, we call reset_index:

In[203]:
df1.groupby(['A','B']).size().reset_index()

Out[203]:
A B 0
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3

This restores the indices but the size aggregation is turned into a generated column 0, so we have to rename this:

In[204]:
df1.groupby(['A','B']).size().reset_index().rename(columns={0:'count'})

Out[204]:
A B count
0 no no 1
1 no yes 2
2 yes no 4
3 yes yes 3

groupby does accept the arg as_index which we could have set to False so it doesn't make the grouped columns the index, but this generates a series and you'd still have to restore the indices and so on....:

In[205]:
df1.groupby(['A','B'], as_index=False).size()

Out[205]:
A B
no no 1
yes 2
yes no 4
yes 3
dtype: int64

How can I merge rows by same value in a column in Pandas with aggregation functions?

You are looking for

aggregation_functions = {'price': 'sum', 'amount': 'sum', 'name': 'first'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)

which gives

    price     name  amount
id
1 130 anna 3
2 42 bob 30
3 3 charlie 110

Multiple rows to single cell space delimited values in pandas with group by

Convert the 'value' column from int to string, then perform a groupby on 'id' and apply the str.join function:

# Convert 'value' column to string.
df1['value'] = df1['value'].astype(str)

# Perform a groupby and apply a string join.
df1 = df1.groupby('id')['value'].apply(' '.join).reset_index()

The resulting output:

   id  value
0 1 67 45
1 2 7 5 9

Pandas aggregate count distinct

How about either of:

>>> df
date duration user_id
0 2013-04-01 30 0001
1 2013-04-01 15 0001
2 2013-04-01 20 0002
3 2013-04-02 15 0002
4 2013-04-02 30 0002
>>> df.groupby("date").agg({"duration": np.sum, "user_id": pd.Series.nunique})
duration user_id
date
2013-04-01 65 2
2013-04-02 45 1
>>> df.groupby("date").agg({"duration": np.sum, "user_id": lambda x: x.nunique()})
duration user_id
date
2013-04-01 65 2
2013-04-02 45 1

Aggregate pandas dataframe but collapse duplicate cell values

If I Understand Correctly:

try via groupby()+agg() and use set for unique values instead of list:

df=df.groupby('query').agg(lambda x:' | '.join(set(x)))

OR

If order is important then use pd.unique() for unique values:

df=df.groupby('query').agg(lambda x:' | '.join(pd.unique(x)))

OR

If want to perform on selected columns then create a list of those columns and perform aggregration only on those columns:

cols=['knum','definition','A','B','C']
df=df.groupby('query')[cols].agg(lambda x:' | '.join(set(x)))


Related Topics



Leave a reply



Submit