Pandas Groupby: How to Get a Union of Strings

Pandas groupby: How to get a union of strings

In [4]: df = read_csv(StringIO(data),sep='\s+')

In [5]: df
Out[5]: 
   A         B       C
0  1  0.749065    This
1  2  0.301084      is
2  3  0.463468       a
3  4  0.643961  random
4  1  0.866521  string
5  2  0.120737       !

In [6]: df.dtypes
Out[6]: 
A      int64
B    float64
C     object
dtype: object

When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum() to the groupby

In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]: 
   A         B           C
A                         
1  2  1.615586  Thisstring
2  4  0.421821         is!
3  3  0.463468           a
4  4  0.643961      random

sum by default concatenates

In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]: 
A
1    Thisstring
2           is!
3             a
4        random
dtype: object

You can do pretty much what you want

In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]: 
A
1    {This, string}
2           {is, !}
3               {a}
4          {random}
dtype: object

Doing this on a whole frame, one group at a time. Key is to return a Series

def f(x):
     return Series(dict(A = x['A'].sum(), 
                        B = x['B'].sum(), 
                        C = "{%s}" % ', '.join(x['C'])))

In [14]: df.groupby('A').apply(f)
Out[14]: 
   A         B               C
A                             
1  2  1.615586  {This, string}
2  4  0.421821         {is, !}
3  3  0.463468             {a}
4  4  0.643961        {random}

Data Error Using function and groupby to union strings in pandas dataframe

I think some column is numeric and need string.

So use astype and if need remove NaNs add dropna:

def f(x):
 return pd.Series(dict(A = x['Entry'].sum(), 
                    B = ''.join(x['Address'].dropna().astype(str)), 
                    C = '; '.join(x['ShortOrdDesc'].astype(str))))

myobj = ordersToprint.groupby('Entry').apply(f)
print (myobj)
          A               B                              C
Entry                                                     
988     988  Fake Address 1                     SC_M_W_3_1
989     989  Fake Address 2                     SC_M_W_3_3
992    2976  Fake Address 3  nan_2; SC_M_G_1_1; SC_M_O_1_1

Another solution with agg, but then is necessary rename columns:

f = {'Entry':'sum', 
      'Address' : lambda x: ''.join(x.dropna().astype(str)), 
      'ShortOrdDesc' : lambda x: '; '.join(x.astype(str))}
cols = {'Entry':'A','Address':'B','ShortOrdDesc':'C'}
myobj = ordersToprint.groupby('Entry').agg(f).rename(columns=cols)[['A','B','C']]
print (myobj)
          A               B                              C
Entry                                                     
988     988  Fake Address 1                     SC_M_W_3_1
989     989  Fake Address 2                     SC_M_W_3_3
992    2976  Fake Address 3  nan_2; SC_M_G_1_1; SC_M_O_1_1

Pandas Dataframe Groupby join string whilst preserving order of strings

Use the sort=False parameter in groupby and drop_duplicates instead set:

df = df.sort_values(
        ['id', 'order_column']
    ).groupby('id', sort=False).agg(
        {
            'channel': lambda x: ' > '.join(x.drop_duplicates()),
            'value': np.sum
        }
    )

How to use groupby to concatenate strings in python pandas?

You can apply join on your column after groupby:

df.groupby('index')['words'].apply(','.join)

Example:

In [326]:
df = pd.DataFrame({'id':['a','a','b','c','c'], 'words':['asd','rtr','s','rrtttt','dsfd']})
df

Out[326]:
  id   words
0  a     asd
1  a     rtr
2  b       s
3  c  rrtttt
4  c    dsfd

In [327]:
df.groupby('id')['words'].apply(','.join)

Out[327]:
id
a        asd,rtr
b              s
c    rrtttt,dsfd
Name: words, dtype: object

Python Pandas: Groupby Sum AND Concatenate Strings

Let us make it into one line

df.groupby(['ID','Name'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ' '.join(x))
Out[1510]: 
   ID Name  COMMENT1      COMMENT2  NUM
0   1  dan    hi you  hello friend  3.0
1   2  jon       dog           cat  0.5
2   3  jon  yeah yes       nope no  3.1

pandas groupby and join lists

object dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects tries to convert a column to one of those dtypes.

You want

In [63]: df
Out[63]: 
   a          b    c
0  1  [1, 2, 3]  foo
1  1     [2, 5]  bar
2  2     [5, 6]  baz

In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]: 
         c                b
a                          
1  foo bar  [1, 2, 3, 2, 5]
2      baz           [5, 6]

This groups the data frame by the values in column a. Read more about groupby.

This is doing a regular list sum (concatenation) just like [1, 2, 3] + [2, 5] with the result [1, 2, 3, 2, 5]

Pandas: Union strings in dataframe

This get's real close. Not sure if getting that order correct is important to you.

Also, I made an assumption that I should groupby ID. This means that if the same ID spans across another ID and still in the same subdomain, I'll aggregate the active_seconds.

def proc_id(df):
    cond = df.subdomain != df.subdomain.shift()
    part = cond.cumsum()
    df_ = df.groupby(part).first()
    df_.active_seconds = df.groupby(part).active_seconds.sum()
    return df_

df.groupby('ID').apply(proc_id).reset_index(drop=True)

Sample Image

Conditionally concatenate strings within a groupby aggregate function

I can't find a way to do this within agg so if anyone does please do say.

However it's easily done outside of agg, with:

df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg(    #Remove TABLE from first agg
            {'TT' : 'max','REC' : 'sum', 'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc = pd.merge(df_table_acc, df[df['cat_a']>0].copy().groupby(['SYSTIME'],as_index=False).agg(
            {'TABLE':';'.join}),how='left',on='SYSTIME')

This was edited for indexing issues. We are now using merge on SYSTIME to make sure the TABLE matches the SYSTIME

Alternatively, by changing the data, with a bit of cleanup afterwards (EDIT: fixed this part and added better separation)

import re
df['TABLE'] = df.apply(lambda x: x['TABLE'] if x['cat_a']>0 else '', axis=1)
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg(
            {'TT' : 'max','REC' : 'sum','TABLE': ';'.join, 
             'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc.TABLE = df_table_acc.TABLE.apply(lambda x: re.sub(';+',';',x).strip(';'))
#Quick explanation: the re part avoids having repeat ";" eg: "A;;C;D;;G" -> "A;C;D;G"
#The strip removes outside strings eg: ";A;B;" -> "A;B"

Make sure you don't need the TABLE column for anything else before using the second method, or use a dummy column like TABLE2 or something.

Pandas Groupby: How to Get a Union of Strings