Difference between groupby and pivot_table
Both pivot_table
and groupby
are used to aggregate your dataframe. The difference is only with regard to the shape of the result.
Using pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
a table is created where a
is on the row axis, b
is on the column axis, and the values are the sum of c
.
Example:
df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)
b 1 2
a
1 0.528470 0.484766
2 0.187277 0.144326
3 0.866832 0.650100
Using groupby
, the dimensions given are placed into columns, and rows are created for each combination of those dimensions.In this example, we create a series of the sum of values c
, grouped by all unique combinations of a
and b
.
df.groupby(['a','b'])['c'].sum()
a b
1 1 0.528470
2 0.484766
2 1 0.187277
2 0.144326
3 1 0.866832
2 0.650100
Name: c, dtype: float64
A similar usage of groupby
is if we omit the ['c']
. In this case, it creates a dataframe (not a series) of the sums of all remaining columns grouped by unique values of a
and b
.print df.groupby(["a","b"]).sum()
c
a b
1 1 0.528470
2 0.484766
2 1 0.187277
2 0.144326
3 1 0.866832
2 0.650100
Pivot table based on groupby in Pandas
You may looking for crosstab
>>> pd.crosstab([df.customer_id,df.date], df.category)
category computer drinks food toys
customer_id date
1 2017-2-1 0 1 0 1
2017-3-1 0 0 1 0
2 2017-2-1 0 0 1 1
3 2017-2-2 1 0 0 0
>>>
>>> pd.crosstab([df.customer_id,df.date],
df.category).reset_index(level=1)
category date computer drinks food toys
customer_id
1 2017-2-1 0 1 0 1
1 2017-3-1 0 0 1 0
2 2017-2-1 0 0 1 1
3 2017-2-2 1 0 0 0
>>>
>>> pd.crosstab([df.customer_id, df.date],
df.category).reset_index(level=1, drop=True)
category computer drinks food toys
customer_id
1 0 1 0 1
1 0 0 1 0
2 0 0 1 1
3 1 0 0 0
>>>
How to group by column in a dataframe and create pivot tables in a loop
Use:
df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2], 'CATEG': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'B'], 'LEVEL': [3.0, 3.0, 2.0, 1.0, 1.0, 2.0, np.nan, np.nan, np.nan, 3.0, 2.0, 3.0, 2.0, 1.0, 1.0, np.nan, 2.0, 1.0, 1.0, 2.0, 3.0, np.nan, np.nan, 3.0, np.nan], 'COLS': ['Apple', 'Orange', 'Orange', 'Orange', 'Apple', 'Apple', 'Berry', 'Car', 'Berry', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Apple', 'Car', 'Orange', 'Orange', 'Apple', 'Apple', 'Apple', 'Berry', 'Car', 'Orange', 'Car'], 'VALUE': [388, 204, 322, 716, 282, 555, 289, 316, 297, 756, 460, 497, 831, 225, 395, 486, 320, 208, 464, 613, 369, 474, 888, 345, 664], 'COMMENT': ['comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment2']})
#check misisng values
mask = df['LEVEL'].isna()
#split DataFrames for different processing
df1 = df[~mask]
df2 = df[mask]
#pivoting with differnet columns parameters
df1 = df1.pivot_table(index=['ID','COMMENT','CATEG'],
columns=['COLS','LEVEL'],
values='VALUE')
# print (df1)
df2 = df2.pivot_table(index=['ID','COMMENT','CATEG'], columns='COLS',values='VALUE')
# print (df1)
from pandas import ExcelWriter
with pd.ExcelWriter('Values.xlsx') as writer:
#groupby by first 2 levels ID, COMMENT
for (ids,comments), sample_df in df1.groupby(['ID','COMMENT']):
#removed first 2 levels, also removed only NaNs columns
df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
#new sheetnames by f-strings
name = f'{ids}_{comments}'
#write to file
df.to_excel(writer,sheet_name=name)
for (ids,comments), sample_df in df2.groupby(['ID','COMMENT']):
df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
name = f'{ids}_NULL_{comments}'
df.to_excel(writer,sheet_name=name)
Another solution without repeating code:mask = df['LEVEL'].isna()
dfs = {'no_null': df[~mask], 'null': df[mask]}
from pandas import ExcelWriter
with pd.ExcelWriter('Values.xlsx') as writer:
for k, v in dfs.items():
if k == 'no_null':
add = ''
cols = ['COLS','LEVEL']
else:
add = 'NULL_'
cols = 'COLS'
df = v.pivot_table(index=['ID','COMMENT','CATEG'], columns=cols, values='VALUE')
for (ids,comments), sample_df in df.groupby(['ID','COMMENT']):
df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
name = f'{ids}_{add}{comments}'
df.to_excel(writer,sheet_name=name)
Pivot Tables or Group By for Pandas?
Here are couple of ways to reshape your data df
In [27]: df
Out[27]:
Col X Col Y
0 class 1 cat 1
1 class 2 cat 1
2 class 3 cat 2
3 class 2 cat 3
1) Using pd.crosstab()
In [28]: pd.crosstab(df['Col X'], df['Col Y'])
Out[28]:
Col Y cat 1 cat 2 cat 3
Col X
class 1 1 0 0
class 2 1 0 1
class 3 0 1 0
2) Or, use groupby
on 'Col X','Col Y'
with unstack
over Col Y
, then fill NaNs
with zeros.In [29]: df.groupby(['Col X','Col Y']).size().unstack('Col Y', fill_value=0)
Out[29]:
Col Y cat 1 cat 2 cat 3
Col X
class 1 1 0 0
class 2 1 0 1
class 3 0 1 0
3) Or, use pd.pivot_table()
with index=Col X
, columns=Col Y
In [30]: pd.pivot_table(df, index=['Col X'], columns=['Col Y'], aggfunc=len, fill_value=0)
Out[30]:
Col Y cat 1 cat 2 cat 3
Col X
class 1 1 0 0
class 2 1 0 1
class 3 0 1 0
4) Or, use set_index
with unstack
In [492]: df.assign(v=1).set_index(['Col X', 'Col Y'])['v'].unstack(fill_value=0)
Out[492]:
Col Y cat 1 cat 2 cat 3
Col X
class 1 1 0 0
class 2 1 0 1
class 3 0 1 0
Grouping pivot table dates by week in pandas
You can use groupby
with DataFrameGroupBy.resample
, then ouput is:
f = lambda x: x.nunique()
df = df.set_index('date').groupby(['category','subcategory'])['order_id'].resample('W').agg(f).unstack()
print (df)
date 2021-05-09 2021-05-16 2021-05-23 2021-05-30 \
category subcategory
A aa 1.0 0.0 0.0 0.0
dd NaN NaN NaN NaN
ff NaN NaN NaN NaN
B aa 1.0 NaN NaN NaN
C cc NaN NaN NaN NaN
D aa NaN NaN NaN NaN
date 2021-06-06 2021-06-13 2021-06-20 2021-06-27 \
category subcategory
A aa 0.0 0.0 0.0 0.0
dd 1.0 NaN NaN NaN
ff NaN NaN NaN NaN
B aa NaN NaN NaN NaN
C cc NaN NaN NaN NaN
D aa NaN NaN NaN NaN
date 2021-07-04 2021-07-11 2021-10-10
category subcategory
A aa 0.0 1.0 NaN
dd NaN NaN NaN
ff NaN 1.0 NaN
B aa NaN NaN NaN
C cc NaN 1.0 NaN
D aa NaN NaN 1.0
Or if use Grouper
output is:f = lambda x: x.nunique()
df = df.groupby(['category','subcategory', pd.Grouper(freq='W', key='date')])['order_id'].agg(f).unstack()
print (df)
date 2021-05-09 2021-07-11 2021-06-06 2021-10-10
category subcategory
A aa 1.0 1.0 NaN NaN
dd NaN NaN 1.0 NaN
ff NaN 1.0 NaN NaN
B aa 1.0 NaN NaN NaN
C cc NaN 1.0 NaN NaN
D aa NaN NaN NaN 1.0
pandas pivot table for multiple groups
Perhaps a pivot table is not the right way to solve the issue.
A minimal solution could look like the code below and iterate over all the cohorts.
Is there a possibility for a more efficient solution? My input file is 120G for an uncompressed CSV / when compressed via gzip 3GB remain which translate to about 35GB of memory requirements for pandas.
%pylab inline
import seaborn as sns
sns.set(color_codes=True)
import pandas as pd
import numpy as np
df = pd.DataFrame({"user_id": [1, 2, 3, 4, 5,
6, 7, 8, 9],
"is_sick": [0, 0, 0, 0, 0,
0, 1, 1, 1],
"sex": ["male", "female", "male", "female", "female",
"male", "male", "female", "female"],
"age_group": ["young", "old", "old", "young",
"small", "old", "young", "young",
"old"],
"metric_1": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
df['date'] = '2019-01-01'
df['qcut_metric_1'] = pd.qcut(df.metric_1, [0, .25, .5, .66, .75, .97, 1])
# make some more data
df_2 = df.copy()
df_2['date'] = '2019-02-01'
df = pd.concat([df, df_2])
cohorts = [['sex', 'age_group'], ['sex'], ['age_group']]
for cohort in cohorts:
cohort_name = '_'.join(cohort)
# print(cohort_name)
agg = df.groupby(cohort).agg({'user_id':pd.Series.nunique, 'is_sick':pd.Series.mean})
sick_percentage_column = f'sick_percentage__{cohort_name}'
agg.columns = ['unique_users', sick_percentage_column]
merged = df.merge(agg, on=cohort) # INNER (default) JOIN ok, as agg derived from total => no values lost
groupings = ['qcut_metric_1']
groupings.extend(cohort)
result = merged.groupby(groupings).apply(lambda x: np.average(x[sick_percentage_column], weights= x['unique_users'])).reset_index().rename({0:sick_percentage_column}, axis=1)
display(result)
Pivot table based on the first value of the group in Pandas
You can use pivot
and then dropna
for each column:
>>> df.pivot(columns='col1', values='col2').apply(lambda x: x.dropna().tolist()).astype(int)
col1 a b c
0 1 2 9
1 4 5 0
2 6 8 7
(Pandas) How to repeat item labels in pandas group by function?
It got solved by resetting index values
group_by_meta = df.groupby(['meta_weight','meta_weight_type', 'weight_conversion'])['Sales_Unit'].sum()
print(group_by_meta)
group_by_meta = group_by_meta.reset_index()
Related Topics
Elif' in List Comprehension Conditionals
What Is the Inverse Function of Zip in Python
How to Upsert Pandas Dataframe to Microsoft SQL Server Table
Building a Minimal Plugin Architecture in Python
Setting Up S3 for Logs in Airflow
Why Use Sys.Path.Append(Path) Instead of Sys.Path.Insert(1, Path)
Can't Use '\1' Backreference to Capture-Group in a Function Call in Re.Sub() Repr Expression
Counting Cars Opencv + Python Issue
Django: How to Manage Development and Production Settings
Wrapping Long Y Labels in Matplotlib Tight Layout Using Setp
Scale Everything on Pygame Display Surface
How to Troubleshoot Python "Could Not Find Platform Independent Libraries <Prefix>"
Find Index of Last Occurrence of a Substring in a String
Numpy 1.21.2 May Not Yet Support Python 3.10
How to Change Data Points Color Based on Some Variable
What Is the Purpose of Meshgrid in Python/Numpy
Why Isn't Assigning to an Empty List (E.G. [] = "") an Error