Pandas Groupby Without Turning Grouped by Column into Index

pandas groupby without turning grouped by column into index

df.groupby(['col2','col3'], as_index=False).sum()

DataFrame 'groupby' is fixing group columns with index

Try -

df = df.groupby(['col1', 'col2', 'col3'], as_index = False).sum()
#or
df = df.groupby(['col1', 'col2', 'col3']).sum().reset_index()

pandas: groupby and aggregate without losing the column which was grouped

If you don't want the groupby as an index, there is an argument for it to avoid further reset:

df.groupby('Id', as_index=False).agg(lambda x: set(x))

groupby with multi level column index python

I would recommend restructuring d1 a bit first...

d1 = d1.set_index([('id','-'),('group','-')]).stack([0,1]).reset_index()
d1.columns = ['id','group','level_1','level_2','category']

id group level_1 level_2 category
0 i1 a g1 1 dog
1 i1 a g1 2 mouse
2 i1 a g2 1 cat
3 i1 a g2 2 mouse
4 i2 a g1 1 cat
5 i2 a g1 2 mouse
6 i2 a g2 1 dog
7 i2 a g2 2 dog
8 i3 a g1 1 dog
9 i3 a g1 2 dog
10 i3 a g2 1 cat
11 i3 a g2 2 dog
12 i4 b g1 1 cat
13 i4 b g1 2 dog
14 i4 b g2 1 dog
15 i4 b g2 2 cat

...and then using either pivot_table or groupby (result is the same)...

# pivot_table
d2 = pd.pivot_table(d1, index=['group', 'category'], columns=['level_1','level_2'], aggfunc='count', fill_value=0).droplevel(0, axis=1).rename_axis([None,None], axis=1)

# groupby
d2 = d1.groupby(['group','category','level_1','level_2'])['id'].count().unstack(['level_1','level_2'], fill_value=0).rename_axis([None,None], axis=1).sort_index(axis=1)

g1 g2
1 2 1 2
group category
a cat 1 0 2 0
dog 2 1 1 2
mouse 0 2 0 1
b cat 1 0 0 1
dog 0 1 1 0

Pandas: assign an index to each group identified by groupby

Here's a concise way using drop_duplicates and merge to get a unique identifier.

group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )

a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5

The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).

Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by @Constantino and a subsequent answer by @CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.

How to keep original index of a DataFrame after groupby 2 columns?

I think you are are looking for transform in this situation:

df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')

Problem with a column in my groupby new object

Your question is similar to this question: pandas groupby without turning grouped by column into index

When you group by a column, the column you group by ceases to be a column, and is instead the index of the resulting operation. The index is not a column, it is an index. If you set as_index=False, pandas keeps the column over which you are grouping as a column, instead of moving it to the index.

The second problem is the .agg() function is also aggregating occ over trip_departure_date, and moving trip_departure_date to an index. You don't need this second function to get the mean of occ grouped by trip_departure_date.

import pandas as pd

df1 = pd.read_csv("trip_departures.txt")

Sample Image

df1_agg = df1.groupby(['trip_departure_date'],as_index=False).mean()

Or if you only want to aggregate the occ column:

df1_agg = df1.groupby(['trip_departure_date'],as_index=False)['occ'].mean()

Sample Image

df1_agg.plot(x = 'trip_departure_date', y = 'occ', figsize = (8,5), color = 'purple')

Sample Image

Pandas: How to remove the index column after groupby and unstack?

An alternative to your solution, but the key is just to add a rename_axis(columns = None), as the date is the name for the columns axis:

(example[["date", "customer", "sales"]]
.groupby(["date", "customer"])
.sum()
.unstack("date")
.droplevel(0, axis="columns")
.rename_axis(columns=None)
.reset_index())

customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9

How to create a groupby dataframe without a multi-level index

  • a is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.
    • The index can't be reset when the name in the index and the column are the same.
    • Use pandas.Series.reset_index, and set name='normalized_bin, to rename the bin column.
      • This would not work with the implementation in the OP, because that is a dataframe.
      • This works with the following implementation, because a pandas.Series is created with .groupby.
  • The correct way to normalize the column is to use the normalize=True parameter in .value_counts.
import pandas as pd
import random # for test data
import numpy as np # for test data

# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)

# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring

# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')

# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037

Using the OP code for a

  • As already noted above, use normalize=True to get normalized values
  • The solution in the OP, creates a DataFrame, because the .groupby is wrapped with the DataFrame constructor, pandas.DataFrame.
    • To reset the index, you must first pandas.DataFrame.rename the bin column, and then use pandas.DataFrame.reset_index
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()

Other Resources

  • See Pandas unable to reset index because name exist to reset by a level.

Plotting

  • It is easier to plot from the multi-index Series, by using pandas.Series.unstack(), and then use pandas.DataFrame.plot.bar
  • For side-by-side bars, set stacked=False.
  • The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()

# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')

Sample Image



Related Topics



Leave a reply



Submit