Pandas Groupby: How to Get a Union of Strings

Pandas groupby: How to get a union of strings

In [4]: df = read_csv(StringIO(data),sep='\s+')

In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !

In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random

sum by default concatenates

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object

You can do pretty much what you want

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object

Doing this on a whole frame, one group at a time. Key is to return a Series

def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}

Data Error Using function and groupby to union strings in pandas dataframe

I think some column is numeric and need string.

So use astype and if need remove NaNs add dropna:

def f(x):
return pd.Series(dict(A = x['Entry'].sum(),
B = ''.join(x['Address'].dropna().astype(str)),
C = '; '.join(x['ShortOrdDesc'].astype(str))))

myobj = ordersToprint.groupby('Entry').apply(f)
print (myobj)
A B C
Entry
988 988 Fake Address 1 SC_M_W_3_1
989 989 Fake Address 2 SC_M_W_3_3
992 2976 Fake Address 3 nan_2; SC_M_G_1_1; SC_M_O_1_1

Another solution with agg, but then is necessary rename columns:

f = {'Entry':'sum', 
'Address' : lambda x: ''.join(x.dropna().astype(str)),
'ShortOrdDesc' : lambda x: '; '.join(x.astype(str))}
cols = {'Entry':'A','Address':'B','ShortOrdDesc':'C'}
myobj = ordersToprint.groupby('Entry').agg(f).rename(columns=cols)[['A','B','C']]
print (myobj)
A B C
Entry
988 988 Fake Address 1 SC_M_W_3_1
989 989 Fake Address 2 SC_M_W_3_3
992 2976 Fake Address 3 nan_2; SC_M_G_1_1; SC_M_O_1_1

Pandas Dataframe Groupby join string whilst preserving order of strings

Use the sort=False parameter in groupby and drop_duplicates instead set:

df = df.sort_values(
['id', 'order_column']
).groupby('id', sort=False).agg(
{
'channel': lambda x: ' > '.join(x.drop_duplicates()),
'value': np.sum
}
)

How to use groupby to concatenate strings in python pandas?

You can apply join on your column after groupby:

df.groupby('index')['words'].apply(','.join)

Example:

In [326]:
df = pd.DataFrame({'id':['a','a','b','c','c'], 'words':['asd','rtr','s','rrtttt','dsfd']})
df

Out[326]:
id words
0 a asd
1 a rtr
2 b s
3 c rrtttt
4 c dsfd

In [327]:
df.groupby('id')['words'].apply(','.join)

Out[327]:
id
a asd,rtr
b s
c rrtttt,dsfd
Name: words, dtype: object

Python Pandas: Groupby Sum AND Concatenate Strings

Let us make it into one line

df.groupby(['ID','Name'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ' '.join(x))
Out[1510]:
ID Name COMMENT1 COMMENT2 NUM
0 1 dan hi you hello friend 3.0
1 2 jon dog cat 0.5
2 3 jon yeah yes nope no 3.1

pandas groupby and join lists

object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.

You want

In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz

In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]

This groups the data frame by the values in column a. Read more about groupby.

This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]

Pandas: Union strings in dataframe

This get's real close. Not sure if getting that order correct is important to you.

Also, I made an assumption that I should groupby ID. This means that if the same ID spans across another ID and still in the same subdomain, I'll aggregate the active_seconds.

def proc_id(df):
cond = df.subdomain != df.subdomain.shift()
part = cond.cumsum()
df_ = df.groupby(part).first()
df_.active_seconds = df.groupby(part).active_seconds.sum()
return df_

df.groupby('ID').apply(proc_id).reset_index(drop=True)

Sample Image

Conditionally concatenate strings within a groupby aggregate function

I can't find a way to do this within agg so if anyone does please do say.

However it's easily done outside of agg, with:

df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg(    #Remove TABLE from first agg
{'TT' : 'max','REC' : 'sum', 'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc = pd.merge(df_table_acc, df[df['cat_a']>0].copy().groupby(['SYSTIME'],as_index=False).agg(
{'TABLE':';'.join}),how='left',on='SYSTIME')

This was edited for indexing issues. We are now using merge on SYSTIME to make sure the TABLE matches the SYSTIME

Alternatively, by changing the data, with a bit of cleanup afterwards (EDIT: fixed this part and added better separation)

import re
df['TABLE'] = df.apply(lambda x: x['TABLE'] if x['cat_a']>0 else '', axis=1)
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg(
{'TT' : 'max','REC' : 'sum','TABLE': ';'.join,
'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc.TABLE = df_table_acc.TABLE.apply(lambda x: re.sub(';+',';',x).strip(';'))
#Quick explanation: the re part avoids having repeat ";" eg: "A;;C;D;;G" -> "A;C;D;G"
#The strip removes outside strings eg: ";A;B;" -> "A;B"

Make sure you don't need the TABLE column for anything else before using the second method, or use a dummy column like TABLE2 or something.



Related Topics



Leave a reply



Submit