Pivot Tables or Group by for Pandas

Difference between groupby and pivot_table

Both pivot_table and groupby are used to aggregate your dataframe. The difference is only with regard to the shape of the result.

Using pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum) a table is created where a is on the row axis, b is on the column axis, and the values are the sum of c.

Example:

df = pd.DataFrame({"a": [1,2,3,1,2,3], "b":[1,1,1,2,2,2], "c":np.random.rand(6)})
pd.pivot_table(df, index=["a"], columns=["b"], values=["c"], aggfunc=np.sum)

b         1         2
a                    
1  0.528470  0.484766
2  0.187277  0.144326
3  0.866832  0.650100

Using groupby, the dimensions given are placed into columns, and rows are created for each combination of those dimensions.

In this example, we create a series of the sum of values c, grouped by all unique combinations of a and b.

df.groupby(['a','b'])['c'].sum()

a  b
1  1    0.528470
   2    0.484766
2  1    0.187277
   2    0.144326
3  1    0.866832
   2    0.650100
Name: c, dtype: float64

A similar usage of groupby is if we omit the ['c']. In this case, it creates a dataframe (not a series) of the sums of all remaining columns grouped by unique values of a and b.

print df.groupby(["a","b"]).sum()
            c
a b          
1 1  0.528470
  2  0.484766
2 1  0.187277
  2  0.144326
3 1  0.866832
  2  0.650100

Pivot table based on groupby in Pandas

You may looking for crosstab

>>> pd.crosstab([df.customer_id,df.date], df.category)                                                                                                                
category              computer  drinks  food  toys
customer_id date                                  
1           2017-2-1         0       1     0     1
            2017-3-1         0       0     1     0
2           2017-2-1         0       0     1     1
3           2017-2-2         1       0     0     0
>>>
>>> pd.crosstab([df.customer_id,df.date],
                df.category).reset_index(level=1)                                                                                           
category         date  computer  drinks  food  toys
customer_id                                        
1            2017-2-1         0       1     0     1
1            2017-3-1         0       0     1     0
2            2017-2-1         0       0     1     1
3            2017-2-2         1       0     0     0
>>>
>>> pd.crosstab([df.customer_id, df.date], 
                df.category).reset_index(level=1, drop=True)                                                                                
category     computer  drinks  food  toys
customer_id                              
1                   0       1     0     1
1                   0       0     1     0
2                   0       0     1     1
3                   1       0     0     0
>>>

How to group by column in a dataframe and create pivot tables in a loop

Use:

df = pd.DataFrame({'ID': [1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2], 'CATEG': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'B'], 'LEVEL': [3.0, 3.0, 2.0, 1.0, 1.0, 2.0,  np.nan,  np.nan,  np.nan, 3.0, 2.0, 3.0, 2.0, 1.0, 1.0,  np.nan, 2.0, 1.0, 1.0, 2.0, 3.0,  np.nan,  np.nan, 3.0,  np.nan], 'COLS': ['Apple', 'Orange', 'Orange', 'Orange', 'Apple', 'Apple', 'Berry', 'Car', 'Berry', 'Apple', 'Apple', 'Orange', 'Orange', 'Orange', 'Apple', 'Car', 'Orange', 'Orange', 'Apple', 'Apple', 'Apple', 'Berry', 'Car', 'Orange', 'Car'], 'VALUE': [388, 204, 322, 716, 282, 555, 289, 316, 297, 756, 460, 497, 831, 225, 395, 486, 320, 208, 464, 613, 369, 474, 888, 345, 664], 'COMMENT': ['comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment1', 'comment2']})

#check misisng values
mask = df['LEVEL'].isna()

#split DataFrames for different processing
df1 = df[~mask]
df2 = df[mask]

#pivoting with differnet columns parameters
df1 = df1.pivot_table(index=['ID','COMMENT','CATEG'], 
                      columns=['COLS','LEVEL'],
                      values='VALUE')
# print (df1)

df2 = df2.pivot_table(index=['ID','COMMENT','CATEG'], columns='COLS',values='VALUE')
# print (df1)

from pandas import ExcelWriter
with pd.ExcelWriter('Values.xlsx') as writer: 
    
    #groupby by first 2 levels ID, COMMENT
    for (ids,comments), sample_df in df1.groupby(['ID','COMMENT']):
        #removed first 2 levels, also removed only NaNs columns
        df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
        #new sheetnames by f-strings
        name = f'{ids}_{comments}'
        #write to file
        df.to_excel(writer,sheet_name=name)
        
    for (ids,comments), sample_df in df2.groupby(['ID','COMMENT']):
        df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
        name = f'{ids}_NULL_{comments}'
        df.to_excel(writer,sheet_name=name)

Another solution without repeating code:

mask = df['LEVEL'].isna()

dfs = {'no_null': df[~mask], 'null': df[mask]}

from pandas import ExcelWriter
with pd.ExcelWriter('Values.xlsx') as writer: 
    
    for k, v in dfs.items():
        if k == 'no_null':
            add = ''
            cols = ['COLS','LEVEL']
        else:
             add = 'NULL_'
             cols = 'COLS'
        
        df = v.pivot_table(index=['ID','COMMENT','CATEG'], columns=cols, values='VALUE')
          
        for (ids,comments), sample_df in df.groupby(['ID','COMMENT']):
            df = sample_df.reset_index(level=[1], drop=True).dropna(how='all', axis=1)
            name = f'{ids}_{add}{comments}'
            df.to_excel(writer,sheet_name=name)

Pivot Tables or Group By for Pandas?

Here are couple of ways to reshape your data df

In [27]: df
Out[27]:
     Col X  Col Y
0  class 1  cat 1
1  class 2  cat 1
2  class 3  cat 2
3  class 2  cat 3

1) Using pd.crosstab()

In [28]: pd.crosstab(df['Col X'], df['Col Y'])
Out[28]:
Col Y    cat 1  cat 2  cat 3
Col X
class 1      1      0      0
class 2      1      0      1
class 3      0      1      0

2) Or, use groupby on 'Col X','Col Y' with unstack over Col Y, then fill NaNs with zeros.

In [29]: df.groupby(['Col X','Col Y']).size().unstack('Col Y', fill_value=0)
Out[29]:
Col Y    cat 1  cat 2  cat 3
Col X
class 1      1      0      0
class 2      1      0      1
class 3      0      1      0

3) Or, use pd.pivot_table() with index=Col X, columns=Col Y

In [30]: pd.pivot_table(df, index=['Col X'], columns=['Col Y'], aggfunc=len, fill_value=0)
Out[30]:
Col Y    cat 1  cat 2  cat 3
Col X
class 1      1      0      0
class 2      1      0      1
class 3      0      1      0

4) Or, use set_index with unstack

In [492]: df.assign(v=1).set_index(['Col X', 'Col Y'])['v'].unstack(fill_value=0)
Out[492]:
Col Y    cat 1  cat 2  cat 3
Col X
class 1      1      0      0
class 2      1      0      1
class 3      0      1      0

Grouping pivot table dates by week in pandas

You can use groupby with DataFrameGroupBy.resample, then ouput is:

f = lambda x: x.nunique()
df = df.set_index('date').groupby(['category','subcategory'])['order_id'].resample('W').agg(f).unstack()
print (df)
date                  2021-05-09  2021-05-16  2021-05-23  2021-05-30  \
category subcategory                                                   
A        aa                  1.0         0.0         0.0         0.0   
         dd                  NaN         NaN         NaN         NaN   
         ff                  NaN         NaN         NaN         NaN   
B        aa                  1.0         NaN         NaN         NaN   
C        cc                  NaN         NaN         NaN         NaN   
D        aa                  NaN         NaN         NaN         NaN   

date                  2021-06-06  2021-06-13  2021-06-20  2021-06-27  \
category subcategory                                                   
A        aa                  0.0         0.0         0.0         0.0   
         dd                  1.0         NaN         NaN         NaN   
         ff                  NaN         NaN         NaN         NaN   
B        aa                  NaN         NaN         NaN         NaN   
C        cc                  NaN         NaN         NaN         NaN   
D        aa                  NaN         NaN         NaN         NaN   

date                  2021-07-04  2021-07-11  2021-10-10  
category subcategory                                      
A        aa                  0.0         1.0         NaN  
         dd                  NaN         NaN         NaN  
         ff                  NaN         1.0         NaN  
B        aa                  NaN         NaN         NaN  
C        cc                  NaN         1.0         NaN  
D        aa                  NaN         NaN         1.0

Or if use Grouper output is:

f = lambda x: x.nunique()
df = df.groupby(['category','subcategory', pd.Grouper(freq='W', key='date')])['order_id'].agg(f).unstack()
print (df)
date                  2021-05-09  2021-07-11  2021-06-06  2021-10-10
category subcategory                                                
A        aa                  1.0         1.0         NaN         NaN
         dd                  NaN         NaN         1.0         NaN
         ff                  NaN         1.0         NaN         NaN
B        aa                  1.0         NaN         NaN         NaN
C        cc                  NaN         1.0         NaN         NaN
D        aa                  NaN         NaN         NaN         1.0

pandas pivot table for multiple groups

Perhaps a pivot table is not the right way to solve the issue.

A minimal solution could look like the code below and iterate over all the cohorts.

Is there a possibility for a more efficient solution? My input file is 120G for an uncompressed CSV / when compressed via gzip 3GB remain which translate to about 35GB of memory requirements for pandas.

%pylab inline

import seaborn as sns
sns.set(color_codes=True)

import pandas as pd
import numpy as np
df = pd.DataFrame({"user_id": [1, 2, 3, 4, 5,
                          6, 7, 8, 9],
    "is_sick": [0, 0, 0, 0, 0,
                          0, 1, 1, 1],
                    "sex": ["male", "female", "male", "female", "female",
                          "male", "male", "female", "female"],
                    "age_group": ["young", "old", "old", "young",
                          "small", "old", "young", "young",
                          "old"],
                    "metric_1": [1, 2, 2, 3, 3, 4, 5, 6, 7]})
df['date'] = '2019-01-01'
df['qcut_metric_1'] = pd.qcut(df.metric_1, [0, .25, .5, .66, .75, .97, 1])

# make some more data
df_2 = df.copy()
df_2['date'] = '2019-02-01'
df = pd.concat([df, df_2])
cohorts = [['sex', 'age_group'], ['sex'], ['age_group']]
for cohort in cohorts:
    cohort_name = '_'.join(cohort)
    # print(cohort_name)
    agg = df.groupby(cohort).agg({'user_id':pd.Series.nunique, 'is_sick':pd.Series.mean})
    sick_percentage_column = f'sick_percentage__{cohort_name}'
    agg.columns = ['unique_users', sick_percentage_column]
    merged = df.merge(agg, on=cohort) # INNER (default) JOIN ok, as agg derived from total => no values lost

    groupings = ['qcut_metric_1']
    groupings.extend(cohort)
    result = merged.groupby(groupings).apply(lambda x: np.average(x[sick_percentage_column], weights= x['unique_users'])).reset_index().rename({0:sick_percentage_column}, axis=1)
    display(result)

Pivot table based on the first value of the group in Pandas

You can use pivot and then dropna for each column:

>>> df.pivot(columns='col1', values='col2').apply(lambda x: x.dropna().tolist()).astype(int)
col1  a  b  c
0     1  2  9
1     4  5  0
2     6  8  7

(Pandas) How to repeat item labels in pandas group by function?

It got solved by resetting index values

group_by_meta = df.groupby(['meta_weight','meta_weight_type', 'weight_conversion'])['Sales_Unit'].sum()
print(group_by_meta)

group_by_meta = group_by_meta.reset_index()