Pandas groupby: How to get a union of strings
In [4]: df = read_csv(StringIO(data),sep='\s+')
In [5]: df
Out[5]:
A B C
0 1 0.749065 This
1 2 0.301084 is
2 3 0.463468 a
3 4 0.643961 random
4 1 0.866521 string
5 2 0.120737 !
In [6]: df.dtypes
Out[6]:
A int64
B float64
C object
dtype: object
When you apply your own function, there is not automatic exclusions of non-numeric columns. This is slower, though, than the application of .sum()
to the groupby
In [8]: df.groupby('A').apply(lambda x: x.sum())
Out[8]:
A B C
A
1 2 1.615586 Thisstring
2 4 0.421821 is!
3 3 0.463468 a
4 4 0.643961 random
sum
by default concatenates
In [9]: df.groupby('A')['C'].apply(lambda x: x.sum())
Out[9]:
A
1 Thisstring
2 is!
3 a
4 random
dtype: object
You can do pretty much what you want
In [11]: df.groupby('A')['C'].apply(lambda x: "{%s}" % ', '.join(x))
Out[11]:
A
1 {This, string}
2 {is, !}
3 {a}
4 {random}
dtype: object
Doing this on a whole frame, one group at a time. Key is to return a Series
def f(x):
return Series(dict(A = x['A'].sum(),
B = x['B'].sum(),
C = "{%s}" % ', '.join(x['C'])))
In [14]: df.groupby('A').apply(f)
Out[14]:
A B C
A
1 2 1.615586 {This, string}
2 4 0.421821 {is, !}
3 3 0.463468 {a}
4 4 0.643961 {random}
Data Error Using function and groupby to union strings in pandas dataframe
I think some column is numeric and need string
.
So use astype
and if need remove NaN
s add dropna
:
def f(x):
return pd.Series(dict(A = x['Entry'].sum(),
B = ''.join(x['Address'].dropna().astype(str)),
C = '; '.join(x['ShortOrdDesc'].astype(str))))
myobj = ordersToprint.groupby('Entry').apply(f)
print (myobj)
A B C
Entry
988 988 Fake Address 1 SC_M_W_3_1
989 989 Fake Address 2 SC_M_W_3_3
992 2976 Fake Address 3 nan_2; SC_M_G_1_1; SC_M_O_1_1
Another solution with agg
, but then is necessary rename columns:
f = {'Entry':'sum',
'Address' : lambda x: ''.join(x.dropna().astype(str)),
'ShortOrdDesc' : lambda x: '; '.join(x.astype(str))}
cols = {'Entry':'A','Address':'B','ShortOrdDesc':'C'}
myobj = ordersToprint.groupby('Entry').agg(f).rename(columns=cols)[['A','B','C']]
print (myobj)
A B C
Entry
988 988 Fake Address 1 SC_M_W_3_1
989 989 Fake Address 2 SC_M_W_3_3
992 2976 Fake Address 3 nan_2; SC_M_G_1_1; SC_M_O_1_1
Pandas Dataframe Groupby join string whilst preserving order of strings
Use the sort=False
parameter in groupby
and drop_duplicates
instead set
:
df = df.sort_values(
['id', 'order_column']
).groupby('id', sort=False).agg(
{
'channel': lambda x: ' > '.join(x.drop_duplicates()),
'value': np.sum
}
)
How to use groupby to concatenate strings in python pandas?
You can apply join
on your column after groupby
:
df.groupby('index')['words'].apply(','.join)
Example:
In [326]:
df = pd.DataFrame({'id':['a','a','b','c','c'], 'words':['asd','rtr','s','rrtttt','dsfd']})
df
Out[326]:
id words
0 a asd
1 a rtr
2 b s
3 c rrtttt
4 c dsfd
In [327]:
df.groupby('id')['words'].apply(','.join)
Out[327]:
id
a asd,rtr
b s
c rrtttt,dsfd
Name: words, dtype: object
Python Pandas: Groupby Sum AND Concatenate Strings
Let us make it into one line
df.groupby(['ID','Name'],as_index=False).agg(lambda x : x.sum() if x.dtype=='float64' else ' '.join(x))
Out[1510]:
ID Name COMMENT1 COMMENT2 NUM
0 1 dan hi you hello friend 3.0
1 2 jon dog cat 0.5
2 3 jon yeah yes nope no 3.1
pandas groupby and join lists
object
dtype is a catch-all dtype that basically means not int, float, bool, datetime, or timedelta. So it is storing them as a list. convert_objects
tries to convert a column to one of those dtypes.
You want
In [63]: df
Out[63]:
a b c
0 1 [1, 2, 3] foo
1 1 [2, 5] bar
2 2 [5, 6] baz
In [64]: df.groupby('a').agg({'b': 'sum', 'c': lambda x: ' '.join(x)})
Out[64]:
c b
a
1 foo bar [1, 2, 3, 2, 5]
2 baz [5, 6]
This groups the data frame by the values in column a
. Read more about groupby.
This is doing a regular list sum
(concatenation) just like [1, 2, 3] + [2, 5]
with the result [1, 2, 3, 2, 5]
Pandas: Union strings in dataframe
This get's real close. Not sure if getting that order correct is important to you.
Also, I made an assumption that I should groupby
ID
. This means that if the same ID
spans across another ID
and still in the same subdomain, I'll aggregate the active_seconds
.
def proc_id(df):
cond = df.subdomain != df.subdomain.shift()
part = cond.cumsum()
df_ = df.groupby(part).first()
df_.active_seconds = df.groupby(part).active_seconds.sum()
return df_
df.groupby('ID').apply(proc_id).reset_index(drop=True)
Conditionally concatenate strings within a groupby aggregate function
I can't find a way to do this within agg
so if anyone does please do say.
However it's easily done outside of agg
, with:
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg( #Remove TABLE from first agg
{'TT' : 'max','REC' : 'sum', 'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc = pd.merge(df_table_acc, df[df['cat_a']>0].copy().groupby(['SYSTIME'],as_index=False).agg(
{'TABLE':';'.join}),how='left',on='SYSTIME')
This was edited for indexing issues. We are now using merge
on SYSTIME
to make sure the TABLE
matches the SYSTIME
Alternatively, by changing the data, with a bit of cleanup afterwards (EDIT: fixed this part and added better separation)
import re
df['TABLE'] = df.apply(lambda x: x['TABLE'] if x['cat_a']>0 else '', axis=1)
df_table_acc=df.groupby(['SYSTIME'],as_index=False).agg(
{'TT' : 'max','REC' : 'sum','TABLE': ';'.join,
'cat_a': 'sum', 'cat_b': 'sum', 'cat_c': 'sum'})
df_table_acc.TABLE = df_table_acc.TABLE.apply(lambda x: re.sub(';+',';',x).strip(';'))
#Quick explanation: the re part avoids having repeat ";" eg: "A;;C;D;;G" -> "A;C;D;G"
#The strip removes outside strings eg: ";A;B;" -> "A;B"
Make sure you don't need the TABLE
column for anything else before using the second method, or use a dummy column like TABLE2
or something.
Related Topics
Is It Pythonic: Naming Lambdas
Pandas Read_Csv: Low_Memory and Dtype Options
Python List VS. Array - When to Use
How to Create a Guid/Uuid in Python
How to Convert SQLalchemy Row Object to a Python Dict
How to Convert a Utc Datetime to a Local Datetime Using Only Standard Library
How to Delete Items from a Dictionary While Iterating Over It
What Is the Standard Way to Add N Seconds to Datetime.Time in Python
Unicodedecodeerror: 'Utf8' Codec Can't Decode Byte 0Xa5 in Position 0: Invalid Start Byte
How to Save a New Sheet in an Existing Excel File, Using Pandas
Find Common Substring Between Two Strings
What Does 'Weight' Do in Tkinter
Groupby Results to Dictionary of Lists
Item Frequency Count in Python
Modifying a Python Dict While Iterating Over It
Why Does Checking a Variable Against Multiple Values with 'Or' Only Check the First Value