How to Pivot a Dataframe in Pandas

How to pivot a dataframe in Pandas?

You can use pivot_table:

pd.pivot_table(df, values = 'Value', index=['Country','Year'], columns = 'Indicator').reset_index()

this outputs:

 Indicator  Country     Year    1   2   3   4   5
 0          Angola      2005    6   13  10  11  5
 1          Angola      2006    3   2   7   3   6

Pandas Dataframe converting to pivot table

You can do

out = df.pivot_table(index='ID', columns='Type',values='Score').add_prefix('Score_').reset_index()
Out[355]: 
Type  ID  Score_A  Score_B
0      1      0.3      NaN
1      2      0.2      0.1
2      3      1.1      NaN
3      4      2.0      NaN

Pivot Wider Pandas DataFrame

Try this:

df = sample_df.pivot(columns='month', index='id')
df.columns = df.columns.swaplevel().map(' '.join)

Output:

>>> df
    Feb 2021 sales  Jan 2021 sales  Mar 2021 sales  Feb 2021 profit  Jan 2021 profit  Mar 2021 profit
id                                                                                                   
1              200             100             300               20               10               30
2              500             400             600               50               40               60
3              800             700             900               80               70               90

Python pandas: pivot DataFrame columns in rows

You were right on with using a pivot_table to solve this:

df = df.pivot_table(index='Name', values='Score', columns='Subject') \
    .reset_index() \
    .rename_axis(None, axis=1)
print(df)

Output:

  Name     A     B     C
0  Ali   NaN  96.0  97.0
1  Bob  94.0   NaN  95.0
2  Tom  91.0  92.0  93.0

Pivot also works in this case:

df = df.pivot(index='Name', values='Score', columns='Subject') \
    .reset_index() \
    .rename_axis(None, axis=1)

How can I pivot a dataframe?

We start by answering the first question:

Question 1

Why do I get ValueError: Index contains duplicate entries, cannot reshape

This occurs because pandas is attempting to reindex either a columns or index object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot. I know there are duplicate entries that share the row and col values:

df.duplicated(['row', 'col']).any()

True

So when I pivot using

df.pivot(index='row', columns='col', values='val0')

I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:

df.set_index(['row', 'col'])['val0'].unstack()

Here is a list of idioms we can use to pivot

pd.DataFrame.groupby + pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you unstack the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of groupby with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers.
- Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
pd.DataFrame.set_index + pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the groupby paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We then unstack the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to set_index in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values for index, columns, values.
- Similar to the pivot_table method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
pd.crosstab
- This a specialized version of pivot_table and in its purest form is the most intuitive way to perform several tasks.
pd.factorize + np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies + pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.

Examples

What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table. Then I'll provide alternatives to perform the same task.

Question 3

How do I pivot df such that the col values are columns, row values are the index, mean of val0 are the values, and missing values are 0?

pd.DataFrame.pivot_table

fill_value is not set by default. I tend to set it appropriately. In this case I set it to 0. Notice I skipped question 2 as it's the same as this answer without the fill_value

aggfunc='mean' is the default and I didn't have to set it. I included it to be explicit.

    df.pivot_table(
        values='val0', index='row', columns='col',
        fill_value=0, aggfunc='mean')

    col   col0   col1   col2   col3  col4
    row
    row0  0.77  0.605  0.000  0.860  0.65
    row2  0.13  0.000  0.395  0.500  0.25
    row3  0.00  0.310  0.000  0.545  0.00
    row4  0.00  0.100  0.395  0.760  0.24

pd.DataFrame.groupby

  df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)

pd.crosstab

  pd.crosstab(
      index=df['row'], columns=df['col'],
      values=df['val0'], aggfunc='mean').fillna(0)

Question 4

Can I get something other than mean, like maybe sum?

pd.DataFrame.pivot_table

  df.pivot_table(
      values='val0', index='row', columns='col',
      fill_value=0, aggfunc='sum')

  col   col0  col1  col2  col3  col4
  row
  row0  0.77  1.21  0.00  0.86  0.65
  row2  0.13  0.00  0.79  0.50  0.50
  row3  0.00  0.31  0.00  1.09  0.00
  row4  0.00  0.10  0.79  1.52  0.24

pd.DataFrame.groupby

  df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)

pd.crosstab

  pd.crosstab(
      index=df['row'], columns=df['col'],
      values=df['val0'], aggfunc='sum').fillna(0)

Question 5

Can I do more that one aggregation at a time?

Notice that for pivot_table and crosstab I needed to pass list of callables. On the other hand, groupby.agg is able to take strings for a limited number of special functions. groupby.agg would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.

pd.DataFrame.pivot_table

  df.pivot_table(
      values='val0', index='row', columns='col',
      fill_value=0, aggfunc=[np.size, np.mean])

       size                      mean
  col  col0 col1 col2 col3 col4  col0   col1   col2   col3  col4
  row
  row0    1    2    0    1    1  0.77  0.605  0.000  0.860  0.65
  row2    1    0    2    1    2  0.13  0.000  0.395  0.500  0.25
  row3    0    1    0    2    0  0.00  0.310  0.000  0.545  0.00
  row4    0    1    2    2    1  0.00  0.100  0.395  0.760  0.24

pd.DataFrame.groupby

  df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)

pd.crosstab

  pd.crosstab(
      index=df['row'], columns=df['col'],
      values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')

Question 6

Can I aggregate over multiple value columns?

pd.DataFrame.pivot_table we pass values=['val0', 'val1'] but we could've left that off completely

  df.pivot_table(
      values=['val0', 'val1'], index='row', columns='col',
      fill_value=0, aggfunc='mean')

        val0                             val1
  col   col0   col1   col2   col3  col4  col0   col1  col2   col3  col4
  row
  row0  0.77  0.605  0.000  0.860  0.65  0.01  0.745  0.00  0.010  0.02
  row2  0.13  0.000  0.395  0.500  0.25  0.45  0.000  0.34  0.440  0.79
  row3  0.00  0.310  0.000  0.545  0.00  0.00  0.230  0.00  0.075  0.00
  row4  0.00  0.100  0.395  0.760  0.24  0.00  0.070  0.42  0.300  0.46

pd.DataFrame.groupby

  df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)

Question 7

Can Subdivide by multiple columns?

pd.DataFrame.pivot_table

  df.pivot_table(
      values='val0', index='row', columns=['item', 'col'],
      fill_value=0, aggfunc='mean')

  item item0             item1                         item2
  col   col2  col3  col4  col0  col1  col2  col3  col4  col0   col1  col3  col4
  row
  row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.605  0.86  0.65
  row2  0.35  0.00  0.37  0.00  0.00  0.44  0.00  0.00  0.13  0.000  0.50  0.13
  row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.000  0.28  0.00
  row4  0.15  0.64  0.00  0.00  0.10  0.64  0.88  0.24  0.00  0.000  0.00  0.00

pd.DataFrame.groupby

  df.groupby(
      ['row', 'item', 'col']
  )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

Question 8

Can Subdivide by multiple columns?

pd.DataFrame.pivot_table

  df.pivot_table(
      values='val0', index=['key', 'row'], columns=['item', 'col'],
      fill_value=0, aggfunc='mean')

  item      item0             item1                         item2
  col        col2  col3  col4  col0  col1  col2  col3  col4  col0  col1  col3  col4
  key  row
  key0 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.86  0.00
       row2  0.00  0.00  0.37  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.50  0.00
       row3  0.00  0.00  0.00  0.00  0.31  0.00  0.81  0.00  0.00  0.00  0.00  0.00
       row4  0.15  0.64  0.00  0.00  0.00  0.00  0.00  0.24  0.00  0.00  0.00  0.00
  key1 row0  0.00  0.00  0.00  0.77  0.00  0.00  0.00  0.00  0.00  0.81  0.00  0.65
       row2  0.35  0.00  0.00  0.00  0.00  0.44  0.00  0.00  0.00  0.00  0.00  0.13
       row3  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.28  0.00
       row4  0.00  0.00  0.00  0.00  0.10  0.00  0.00  0.00  0.00  0.00  0.00  0.00
  key2 row0  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.40  0.00  0.00
       row2  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.13  0.00  0.00  0.00
       row4  0.00  0.00  0.00  0.00  0.00  0.64  0.88  0.00  0.00  0.00  0.00  0.00

pd.DataFrame.groupby

  df.groupby(
      ['key', 'row', 'item', 'col']
  )['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)

pd.DataFrame.set_index because the set of keys are unique for both rows and columns

  df.set_index(
      ['key', 'row', 'item', 'col']
  )['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)

Question 9

Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?

pd.DataFrame.pivot_table

  df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')

      col   col0  col1  col2  col3  col4
  row
  row0     1     2     0     1     1
  row2     1     0     2     1     2
  row3     0     1     0     2     0
  row4     0     1     2     2     1

pd.DataFrame.groupby

  df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)

pd.crosstab
```
  pd.crosstab(df['row'], df['col'])
```

pd.factorize + np.bincount

  # get integer factorization `i` and unique values `r`
  # for column `'row'`
  i, r = pd.factorize(df['row'].values)
  # get integer factorization `j` and unique values `c`
  # for column `'col'`
  j, c = pd.factorize(df['col'].values)
  # `n` will be the number of rows
  # `m` will be the number of columns
  n, m = r.size, c.size
  # `i * m + j` is a clever way of counting the
  # factorization bins assuming a flat array of length
  # `n * m`.  Which is why we subsequently reshape as `(n, m)`
  b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
  # BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
  pd.DataFrame(b, r, c)

        col3  col2  col0  col1  col4
  row3     2     0     0     1     0
  row2     1     2     1     0     2
  row0     1     0     1     2     1
  row4     2     2     0     1     1

pd.get_dummies

  pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))

        col0  col1  col2  col3  col4
  row0     1     2     0     1     1
  row2     1     0     2     1     2
  row3     0     1     0     2     0
  row4     0     1     2     2     1

Question 10

How do I convert a DataFrame from long to wide by pivoting on ONLY two
columns?

DataFrame.pivot

The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using GroupBy.cumcount:

  df2.insert(0, 'count', df2.groupby('A').cumcount())
  df2

     count  A   B
  0      0  a   0
  1      1  a  11
  2      2  a   2
  3      3  a  11
  4      0  b  10
  5      1  b  10
  6      2  b  14
  7      0  c   7

The second step is to use the newly created column as the index to call DataFrame.pivot.

  df2.pivot(*df2)
  # df2.pivot(index='count', columns='A', values='B')

  A         a     b    c
  count
  0       0.0  10.0  7.0
  1      11.0  10.0  NaN
  2       2.0  14.0  NaN
  3      11.0   NaN  NaN

DataFrame.pivot_table

Whereas DataFrame.pivot only accepts columns, DataFrame.pivot_table also accepts arrays, so the GroupBy.cumcount can be passed directly as the index without creating an explicit column.

  df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')

  A         a     b    c
  0       0.0  10.0  7.0
  1      11.0  10.0  NaN
  2       2.0  14.0  NaN
  3      11.0   NaN  NaN

Question 11

How do I flatten the multiple index to single index after pivot

If columns type object with string join

df.columns = df.columns.map('|'.join)

else format

df.columns = df.columns.map('{0[0]}|{0[1]}'.format)

pivot pandas dataframe and count true and false values

You could melt it first in order to unpivot, then reconstruct the pivot table in the way you want.

import pandas as pd
df = pd.DataFrame({'year': {0: 2020, 1: 2020, 2: 2021},
 'a': {0: True, 1: False, 2: False},
 'b': {0: False, 1: False, 2: False},
 'c': {0: True, 1: True, 2: True},
 'd': {0: False, 1: False, 2: False},
 'e': {0: True, 1: True, 2: True}})

df.melt(id_vars='year', 
    var_name='col').pivot_table(index=['year','col'],
                                columns='value',
                                aggfunc=lambda x:len(x), fill_value=0).add_prefix('count_')

Output

value     count_False  count_True
year col                         
2020 a              1           1
     b              2           0
     c              0           2
     d              2           0
     e              0           2
2021 a              1           0
     b              1           0
     c              0           1
     d              1           0
     e              0           1

Pivot pandas dataframe from single row to one row per item

Here's a way:

df = df.T
df.index = pd.MultiIndex.from_arrays([[x[x.find('.')+1:] for x in df.index], [x[:x.find('.')] for x in df.index]])
df = df.unstack()

Input:

   id_prop.0  id_prop.1  id_prop.2  prop_number.0  prop_number.1  prop_number.2  prop_value.0  prop_value.1  prop_value.2
0          1          2          3            123            325            754             1             1             1

Output:

  id_prop prop_number prop_value
0       1         123          1
1       2         325          1
2       3         754          1

Explanation:

transpose so we can work with the index instead of the columns
parse each label into the desired label (prefix) and the result number (suffix) split by the . character
update the df's index to be a MultiIndex with two levels: an list of result numbers and a list of desired labels
call unstack to pivot a level of the MultiIndex (the desired labels) to be column headings

UPDATE: To handle labels where result number is the second . separated token with additional tokens to its right (as described in OP's comment), we can do this:

import pandas as pd
json = {
    "building.0.description.bedrooms":{"0":"qrs"}, 
    "building.1.description.bedrooms":{"0":"tuv"}, 
    "building.2.description.bedrooms":{"0":"xyz"}, 

    "id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},
    "prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},
    "prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
print(df.to_string())

df = df.T
df.index = pd.MultiIndex.from_arrays([[x.split('.')[1] for x in df.index], ['.'.join(x.split('.')[0:1] + x.split('.')[2:]) for x in df.index]])
df = df.unstack()
df.columns = df.columns.get_level_values(1)
print(df)

Input:

  building.0.description.bedrooms building.1.description.bedrooms building.2.description.bedrooms  id_prop.0  id_prop.1  id_prop.2  prop_number.0  prop_number.1  prop_number.2  prop_value.0  prop_value.1  prop_value.2
0                             qrs                             tuv                             xyz          1          2          3            123            325            754             1             1             1

Output:

  building.description.bedrooms id_prop prop_number prop_value
0                           qrs       1         123          1
1                           tuv       2         325          1
2                           xyz       3         754          1

How to Pivot a Dataframe in Pandas