Keep Other Columns When Doing Groupby

Python Keep other columns when using sum() with groupby

Something like ?(Assuming you have same otherstuff1 and otherstuff2 under the same name )

df.groupby(['name','otherstuff1','otherstuff2'],as_index=False).sum()
Out[121]:
name otherstuff1 otherstuff2 value1 value2
0 Jack 1.19 2.39 2 3
1 Luke 1.08 1.08 1 1
2 Mark 3.45 3.45 0 1

Keep other columns when doing groupby

Method #1: use idxmin() to get the indices of the elements of minimum diff, and then select those:

>>> df.loc[df.groupby("item")["diff"].idxmin()]
item diff otherstuff
1 1 1 2
6 2 -6 2
7 3 0 0

[3 rows x 3 columns]

Method #2: sort by diff, and then take the first element in each item group:

>>> df.sort_values("diff").groupby("item", as_index=False).first()
item diff otherstuff
0 1 1 2
1 2 -6 2
2 3 0 0

[3 rows x 3 columns]

Note that the resulting indices are different even though the row content is the same.

Panda Group by sum specific columns and keep other columns

Pandas support missing values in groupby from 1.1 version, link.

First idea is create new helper column new with replace missing values to some string, e.g. miss, then grouping by new with aggregate by GroupBy.agg with GroupBy.first, last remove helper level by first reset_index:

df = (df.assign(new= df['ColToKeep'].fillna('miss'))
.groupby(['User', 'new'], sort=False)
.agg({'Col1ToSum':'sum', 'Col2ToSum':'sum', 'ColToKeep':'first'})
.reset_index(level=1, drop=True)
.reset_index())
print (df)
User Col1ToSum Col2ToSum ColToKeep
0 ABC 40 650 1.015
1 ABA 180 100 2.240
2 AAA 60 20 NaN
3 BBB 10 15 NaN
4 XYZ 10 10 1.100
5 XYZ 10 10 1.500

Another idea is replace back miss to NaNs:

df = (df.assign(ColToKeep = df['ColToKeep'].fillna('miss'))
.groupby(['User', 'ColToKeep'], sort=False)[['Col1ToSum', 'Col2ToSum']].sum()
.reset_index()
.replace({'ColToKeep': {'miss':np.nan}}))
print (df)
User ColToKeep Col1ToSum Col2ToSum
0 ABC 1.015 40 650
1 ABA 2.240 180 100
2 AAA NaN 60 20
3 BBB NaN 10 15
4 XYZ 1.100 10 10
5 XYZ 1.500 10 10

How to GroupBy a Dataframe in Pandas and keep Columns

You want the following:

In [20]:
df.groupby(['Name','Type','ID']).count().reset_index()

Out[20]:
Name Type ID Count
0 Book1 ebook 1 2
1 Book2 paper 2 2
2 Book3 paper 3 1

In your case the 'Name', 'Type' and 'ID' cols match in values so we can groupby on these, call count and then reset_index.

An alternative approach would be to add the 'Count' column using transform and then call drop_duplicates:

In [25]:
df['Count'] = df.groupby(['Name'])['ID'].transform('count')
df.drop_duplicates()

Out[25]:
Name Type ID Count
0 Book1 ebook 1 2
1 Book2 paper 2 2
2 Book3 paper 3 1

Pandas groupby apply on one column and keeping the other columns

There is groupby().agg:

df.groupby('name').agg({
'value1': complex_function,
'otherstuff1': 'first',
'otherstuff2':'first'
})

Pandas groupby multiple columns and retain all other columns

I was able to get the desired result by including the other columns in the agg funtion with 'first' while the 'QtyOrdered' & 'QtyShipped' are subject to 'sum'.

ActualOrders = PreActualOrders.groupby(['OrderNo','ItemCode']).agg({'OrderDate': 'first', 'LineNo': 'first', 'ClientNo': 'first', 'QtyOrdered': 'sum', 'QtyShipped': 'sum' }).reset_index()

Yeilds my desired reult of:

      OrderNo   ItemCode    OrderDate LineNo ClientNo QtyOrdered QtyShipped
28255 543734 1038324 2/27/2017 3 1254787 1 1
28256 543734 10137992 2/27/2017 1 1254787 1 1
28257 543734 10137993 2/27/2017 2 1254787 1 1
28258 543735 1041106 2/27/2017 4 1816460 1 1
28259 543735 1041108 2/27/2017 3 1816460 1 1
28260 543735 10135359 2/27/2017 2 1816460 1 1
28261 543735 10137993 2/27/2017 1 1816460 1 1

The output example doesn't show any difference between Qty ordered and shipped because the number of matching cancels is very small. The rows which have a corresponding cancel are correctly adjusted.

Groupby multiple columns and get the sum of two other columns while keeping the first occurrence of every other column

I think that using two separate operations on the groupby object and join them afterwards is clearer than a one-liner. Here is a minimal example, grouping on 1 column:

df = pd.DataFrame(
[
("bird", "Falconiformes", 389.0, 5.5, 1),
("bird", "Psittaciformes", 24.0, 4.5, 2),
("mammal", "Carnivora", 80.2, 33.3, 1),
("mammal", "Primates", np.nan, 33.7, 2),
("mammal", "Carnivora", 58, 23, 3),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "family", "max_speed", "height", "order"),
)
print(df, "\n")

grouped = df.groupby('class')
df_sum = grouped[['max_speed', 'height']].agg(sum)
df_first = grouped['order'].first()
df_out = pd.concat([df_sum, df_first], axis=1)
print(df_out)

Output:

          class          family  max_speed  height  order
falcon bird Falconiformes 389.0 5.5 1
parrot bird Psittaciformes 24.0 4.5 2
lion mammal Carnivora 80.2 33.3 1
monkey mammal Primates NaN 33.7 2
leopard mammal Carnivora 58.0 23.0 3

max_speed height order
class
bird 413.0 10.0 1
mammal 138.2 90.0 1

Is there a way i can use groupby.sum and keep other columns?

You can partition by columns while keeping the other columns using transform:

df['sum'] = df.groupby([1,2,4])[5].transform(sum)

This will simply add a column that has the aggregation at the grouped level for all rows in the original dataframe.



Related Topics



Leave a reply



Submit