pandas groupby without turning grouped by column into index
df.groupby(['col2','col3'], as_index=False).sum()
DataFrame 'groupby' is fixing group columns with index
Try -
df = df.groupby(['col1', 'col2', 'col3'], as_index = False).sum()
#or
df = df.groupby(['col1', 'col2', 'col3']).sum().reset_index()
pandas: groupby and aggregate without losing the column which was grouped
If you don't want the groupby as an index, there is an argument for it to avoid further reset:
df.groupby('Id', as_index=False).agg(lambda x: set(x))
groupby with multi level column index python
I would recommend restructuring d1
a bit first...
d1 = d1.set_index([('id','-'),('group','-')]).stack([0,1]).reset_index()
d1.columns = ['id','group','level_1','level_2','category']
id group level_1 level_2 category
0 i1 a g1 1 dog
1 i1 a g1 2 mouse
2 i1 a g2 1 cat
3 i1 a g2 2 mouse
4 i2 a g1 1 cat
5 i2 a g1 2 mouse
6 i2 a g2 1 dog
7 i2 a g2 2 dog
8 i3 a g1 1 dog
9 i3 a g1 2 dog
10 i3 a g2 1 cat
11 i3 a g2 2 dog
12 i4 b g1 1 cat
13 i4 b g1 2 dog
14 i4 b g2 1 dog
15 i4 b g2 2 cat
...and then using either pivot_table or groupby (result is the same)...
# pivot_table
d2 = pd.pivot_table(d1, index=['group', 'category'], columns=['level_1','level_2'], aggfunc='count', fill_value=0).droplevel(0, axis=1).rename_axis([None,None], axis=1)
# groupby
d2 = d1.groupby(['group','category','level_1','level_2'])['id'].count().unstack(['level_1','level_2'], fill_value=0).rename_axis([None,None], axis=1).sort_index(axis=1)
g1 g2
1 2 1 2
group category
a cat 1 0 2 0
dog 2 1 1 2
mouse 0 2 0 1
b cat 1 0 0 1
dog 0 1 1 0
Pandas: assign an index to each group identified by groupby
Here's a concise way using drop_duplicates
and merge
to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True)
.
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup
method as noted in a comment to the question above by @Constantino and a subsequent answer by @CalumYou. I'll leave this here as an alternate approach but ngroup
seems like the better way to do this in most cases.
How to keep original index of a DataFrame after groupby 2 columns?
I think you are are looking for transform in this situation:
df['count'] = df.groupby(['col1', 'col2'])['col3'].transform('count')
Problem with a column in my groupby new object
Your question is similar to this question: pandas groupby without turning grouped by column into index
When you group by a column, the column you group by ceases to be a column, and is instead the index of the resulting operation. The index is not a column, it is an index. If you set as_index=False
, pandas keeps the column over which you are grouping as a column, instead of moving it to the index.
The second problem is the .agg()
function is also aggregating occ
over trip_departure_date
, and moving trip_departure_date
to an index. You don't need this second function to get the mean of occ
grouped by trip_departure_date
.
import pandas as pd
df1 = pd.read_csv("trip_departures.txt")
df1_agg = df1.groupby(['trip_departure_date'],as_index=False).mean()
Or if you only want to aggregate the occ
column:
df1_agg = df1.groupby(['trip_departure_date'],as_index=False)['occ'].mean()
df1_agg.plot(x = 'trip_departure_date', y = 'occ', figsize = (8,5), color = 'purple')
Pandas: How to remove the index column after groupby and unstack?
An alternative to your solution, but the key is just to add a rename_axis(columns = None)
, as the date
is the name for the columns axis:
(example[["date", "customer", "sales"]]
.groupby(["date", "customer"])
.sum()
.unstack("date")
.droplevel(0, axis="columns")
.rename_axis(columns=None)
.reset_index())
customer 2016-12 2017-01 2017-02
0 123 10.5 6.8 29.5
1 456 25.2 23.4 33.9
How to create a groupby dataframe without a multi-level index
a
is a DataFrame, but with a 2-level index, so my interpretation is you want a dataframe without a multi-level index.- The index can't be reset when the name in the index and the column are the same.
- Use
pandas.Series.reset_index
, and setname='normalized_bin
, to rename thebin
column.- This would not work with the implementation in the OP, because that is a dataframe.
- This works with the following implementation, because a
pandas.Series
is created with.groupby
.
- The correct way to normalize the column is to use the
normalize=True
parameter in.value_counts
.
import pandas as pd
import random # for test data
import numpy as np # for test data
# setup a dataframe with test data
np.random.seed(365)
random.seed(365)
rows = 1100
data = {'bin': np.random.randint(10, size=(rows)),
'season': [random.choice(['fall', 'winter', 'summer', 'spring']) for _ in range(rows)]}
df = pd.DataFrame(data)
# display(df.head())
bin season
0 2 summer
1 4 winter
2 1 summer
3 5 winter
4 2 spring
# groupby, normalize and reset the the Series index
a = df.groupby(['season'])['bin'].value_counts(normalize=True).reset_index(name='normalized_bin')
# display(a.head(15))
season bin normalized_bin
0 fall 2 0.15600
1 fall 9 0.11600
2 fall 3 0.10800
3 fall 4 0.10400
4 fall 6 0.10000
5 fall 0 0.09600
6 fall 8 0.09600
7 fall 5 0.08400
8 fall 7 0.08000
9 fall 1 0.06000
10 spring 0 0.11524
11 spring 8 0.11524
12 spring 9 0.11524
13 spring 3 0.11152
14 spring 1 0.10037
Using the OP code for a
- As already noted above, use
normalize=True
to get normalized values - The solution in the OP, creates a DataFrame, because the
.groupby
is wrapped with the DataFrame constructor,pandas.DataFrame
.- To reset the index, you must first
pandas.DataFrame.rename
thebin
column, and then usepandas.DataFrame.reset_index
- To reset the index, you must first
a = pd.DataFrame(df.groupby('season')['bin'].value_counts()/df.groupby('season')['bin'].count()).rename(columns={'bin': 'normalized_bin'}).reset_index()
Other Resources
- See Pandas unable to reset index because name exist to reset by a
level
.
Plotting
- It is easier to plot from the multi-index Series, by using
pandas.Series.unstack()
, and then usepandas.DataFrame.plot.bar
- For side-by-side bars, set
stacked=False
. - The bars are all equal to 1, because this is normalized data.
s = df.groupby(['season'])['bin'].value_counts(normalize=True).unstack()
# plot a stacked bar
s.plot.bar(stacked=True, figsize=(8, 6))
plt.legend(title='bin', bbox_to_anchor=(1.05, 1), loc='upper left')
Related Topics
Stopping a Thread After a Certain Amount of Time
How to Use a Custom Comparison Function in Python 3
How to Use a (Random) *.Otf or *.Ttf Font in Matplotlib
How to Install Pip for Python 3 on MAC Os X
Why Use Os.Path.Join Over String Concatenation
Truncate to Three Decimals in Python
Opencv Python Rotate Image by X Degrees Around Specific Point
How to Force a List to a Fixed Size
How to Find First Non-Zero Value in Every Column of a Numpy Array
Types That Define '_Eq_' Are Unhashable
Type Object 'Datetime.Datetime' Has No Attribute 'Datetime'
Having Trouble Making a List of Lists of a Designated Size
How to Copy Over an Excel Sheet to Another Workbook in Python
Python Equivalent of Filter() Getting Two Output Lists (I.E. Partition of a List)
How to Sort a List by Length of String Followed by Alphabetical Order