How to pivot a dataframe in Pandas?
You can use pivot_table
:
pd.pivot_table(df, values = 'Value', index=['Country','Year'], columns = 'Indicator').reset_index()
this outputs:
Indicator Country Year 1 2 3 4 5
0 Angola 2005 6 13 10 11 5
1 Angola 2006 3 2 7 3 6
Pandas Dataframe converting to pivot table
You can do
out = df.pivot_table(index='ID', columns='Type',values='Score').add_prefix('Score_').reset_index()
Out[355]:
Type ID Score_A Score_B
0 1 0.3 NaN
1 2 0.2 0.1
2 3 1.1 NaN
3 4 2.0 NaN
Pivot Wider Pandas DataFrame
Try this:
df = sample_df.pivot(columns='month', index='id')
df.columns = df.columns.swaplevel().map(' '.join)
Output:
>>> df
Feb 2021 sales Jan 2021 sales Mar 2021 sales Feb 2021 profit Jan 2021 profit Mar 2021 profit
id
1 200 100 300 20 10 30
2 500 400 600 50 40 60
3 800 700 900 80 70 90
Python pandas: pivot DataFrame columns in rows
You were right on with using a pivot_table to solve this:
df = df.pivot_table(index='Name', values='Score', columns='Subject') \
.reset_index() \
.rename_axis(None, axis=1)
print(df)
Output:
Name A B C
0 Ali NaN 96.0 97.0
1 Bob 94.0 NaN 95.0
2 Tom 91.0 92.0 93.0
Pivot also works in this case:
df = df.pivot(index='Name', values='Score', columns='Subject') \
.reset_index() \
.rename_axis(None, axis=1)
How can I pivot a dataframe?
We start by answering the first question:
Question 1
Why do I get
ValueError: Index contains duplicate entries, cannot reshape
This occurs because pandas is attempting to reindex either a columns
or index
object with duplicate entries. There are varying methods to use that can perform a pivot. Some of them are not well suited to when there are duplicates of the keys in which it is being asked to pivot on. For example. Consider pd.DataFrame.pivot
. I know there are duplicate entries that share the row
and col
values:
df.duplicated(['row', 'col']).any()
True
So when I pivot
using
df.pivot(index='row', columns='col', values='val0')
I get the error mentioned above. In fact, I get the same error when I try to perform the same task with:
df.set_index(['row', 'col'])['val0'].unstack()
Here is a list of idioms we can use to pivot
pd.DataFrame.groupby
+pd.DataFrame.unstack
- Good general approach for doing just about any type of pivot
- You specify all columns that will constitute the pivoted row levels and column levels in one group by. You follow that by selecting the remaining columns you want to aggregate and the function(s) you want to perform the aggregation. Finally, you
unstack
the levels that you want to be in the column index.
pd.DataFrame.pivot_table
- A glorified version of
groupby
with more intuitive API. For many people, this is the preferred approach. And is the intended approach by the developers. - Specify row level, column levels, values to be aggregated, and function(s) to perform aggregations.
- A glorified version of
pd.DataFrame.set_index
+pd.DataFrame.unstack
- Convenient and intuitive for some (myself included). Cannot handle duplicate grouped keys.
- Similar to the
groupby
paradigm, we specify all columns that will eventually be either row or column levels and set those to be the index. We thenunstack
the levels we want in the columns. If either the remaining index levels or column levels are not unique, this method will fail.
pd.DataFrame.pivot
- Very similar to
set_index
in that it shares the duplicate key limitation. The API is very limited as well. It only takes scalar values forindex
,columns
,values
. - Similar to the
pivot_table
method in that we select rows, columns, and values on which to pivot. However, we cannot aggregate and if either rows or columns are not unique, this method will fail.
- Very similar to
pd.crosstab
- This a specialized version of
pivot_table
and in its purest form is the most intuitive way to perform several tasks.
- This a specialized version of
pd.factorize
+np.bincount
- This is a highly advanced technique that is very obscure but is very fast. It cannot be used in all circumstances, but when it can be used and you are comfortable using it, you will reap the performance rewards.
pd.get_dummies
+pd.DataFrame.dot
- I use this for cleverly performing cross tabulation.
Examples
What I'm going to do for each subsequent answer and question is to answer it using pd.DataFrame.pivot_table
. Then I'll provide alternatives to perform the same task.
Question 3
How do I pivot
df
such that thecol
values are columns,row
values are the index, mean ofval0
are the values, and missing values are0
?
pd.DataFrame.pivot_table
fill_value
is not set by default. I tend to set it appropriately. In this case I set it to0
. Notice I skipped question 2 as it's the same as this answer without thefill_value
aggfunc='mean'
is the default and I didn't have to set it. I included it to be explicit.df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='mean')
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65
row2 0.13 0.000 0.395 0.500 0.25
row3 0.00 0.310 0.000 0.545 0.00
row4 0.00 0.100 0.395 0.760 0.24
pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].mean().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='mean').fillna(0)
Question 4
Can I get something other than
mean
, like maybesum
?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc='sum')
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].sum().unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc='sum').fillna(0)
Question 5
Can I do more that one aggregation at a time?
Notice that for pivot_table
and crosstab
I needed to pass list of callables. On the other hand, groupby.agg
is able to take strings for a limited number of special functions. groupby.agg
would also have taken the same callables we passed to the others, but it is often more efficient to leverage the string function names as there are efficiencies to be gained.
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns='col',
fill_value=0, aggfunc=[np.size, np.mean])
size mean
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 1 2 0 1 1 0.77 0.605 0.000 0.860 0.65
row2 1 0 2 1 2 0.13 0.000 0.395 0.500 0.25
row3 0 1 0 2 0 0.00 0.310 0.000 0.545 0.00
row4 0 1 2 2 1 0.00 0.100 0.395 0.760 0.24pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].agg(['size', 'mean']).unstack(fill_value=0)
pd.crosstab
pd.crosstab(
index=df['row'], columns=df['col'],
values=df['val0'], aggfunc=[np.size, np.mean]).fillna(0, downcast='infer')
Question 6
Can I aggregate over multiple value columns?
pd.DataFrame.pivot_table
we passvalues=['val0', 'val1']
but we could've left that off completelydf.pivot_table(
values=['val0', 'val1'], index='row', columns='col',
fill_value=0, aggfunc='mean')
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 0.000 0.860 0.65 0.01 0.745 0.00 0.010 0.02
row2 0.13 0.000 0.395 0.500 0.25 0.45 0.000 0.34 0.440 0.79
row3 0.00 0.310 0.000 0.545 0.00 0.00 0.230 0.00 0.075 0.00
row4 0.00 0.100 0.395 0.760 0.24 0.00 0.070 0.42 0.300 0.46pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0', 'val1'].mean().unstack(fill_value=0)
Question 7
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index='row', columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.605 0.86 0.65
row2 0.35 0.00 0.37 0.00 0.00 0.44 0.00 0.00 0.13 0.000 0.50 0.13
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.000 0.28 0.00
row4 0.15 0.64 0.00 0.00 0.10 0.64 0.88 0.24 0.00 0.000 0.00 0.00pd.DataFrame.groupby
df.groupby(
['row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)
Question 8
Can Subdivide by multiple columns?
pd.DataFrame.pivot_table
df.pivot_table(
values='val0', index=['key', 'row'], columns=['item', 'col'],
fill_value=0, aggfunc='mean')
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
key row
key0 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.86 0.00
row2 0.00 0.00 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.50 0.00
row3 0.00 0.00 0.00 0.00 0.31 0.00 0.81 0.00 0.00 0.00 0.00 0.00
row4 0.15 0.64 0.00 0.00 0.00 0.00 0.00 0.24 0.00 0.00 0.00 0.00
key1 row0 0.00 0.00 0.00 0.77 0.00 0.00 0.00 0.00 0.00 0.81 0.00 0.65
row2 0.35 0.00 0.00 0.00 0.00 0.44 0.00 0.00 0.00 0.00 0.00 0.13
row3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.28 0.00
row4 0.00 0.00 0.00 0.00 0.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00
key2 row0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.40 0.00 0.00
row2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.13 0.00 0.00 0.00
row4 0.00 0.00 0.00 0.00 0.00 0.64 0.88 0.00 0.00 0.00 0.00 0.00pd.DataFrame.groupby
df.groupby(
['key', 'row', 'item', 'col']
)['val0'].mean().unstack(['item', 'col']).fillna(0).sort_index(1)pd.DataFrame.set_index
because the set of keys are unique for both rows and columnsdf.set_index(
['key', 'row', 'item', 'col']
)['val0'].unstack(['item', 'col']).fillna(0).sort_index(1)
Question 9
Can I aggregate the frequency in which the column and rows occur together, aka "cross tabulation"?
pd.DataFrame.pivot_table
df.pivot_table(index='row', columns='col', fill_value=0, aggfunc='size')
col col0 col1 col2 col3 col4
row
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1pd.DataFrame.groupby
df.groupby(['row', 'col'])['val0'].size().unstack(fill_value=0)
pd.crosstab
pd.crosstab(df['row'], df['col'])
pd.factorize
+np.bincount
# get integer factorization `i` and unique values `r`
# for column `'row'`
i, r = pd.factorize(df['row'].values)
# get integer factorization `j` and unique values `c`
# for column `'col'`
j, c = pd.factorize(df['col'].values)
# `n` will be the number of rows
# `m` will be the number of columns
n, m = r.size, c.size
# `i * m + j` is a clever way of counting the
# factorization bins assuming a flat array of length
# `n * m`. Which is why we subsequently reshape as `(n, m)`
b = np.bincount(i * m + j, minlength=n * m).reshape(n, m)
# BTW, whenever I read this, I think 'Bean, Rice, and Cheese'
pd.DataFrame(b, r, c)
col3 col2 col0 col1 col4
row3 2 0 0 1 0
row2 1 2 1 0 2
row0 1 0 1 2 1
row4 2 2 0 1 1pd.get_dummies
pd.get_dummies(df['row']).T.dot(pd.get_dummies(df['col']))
col0 col1 col2 col3 col4
row0 1 2 0 1 1
row2 1 0 2 1 2
row3 0 1 0 2 0
row4 0 1 2 2 1
Question 10
How do I convert a DataFrame from long to wide by pivoting on ONLY two
columns?
DataFrame.pivot
The first step is to assign a number to each row - this number will be the row index of that value in the pivoted result. This is done using
GroupBy.cumcount
:df2.insert(0, 'count', df2.groupby('A').cumcount())
df2
count A B
0 0 a 0
1 1 a 11
2 2 a 2
3 3 a 11
4 0 b 10
5 1 b 10
6 2 b 14
7 0 c 7The second step is to use the newly created column as the index to call
DataFrame.pivot
.df2.pivot(*df2)
# df2.pivot(index='count', columns='A', values='B')
A a b c
count
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaNDataFrame.pivot_table
Whereas
DataFrame.pivot
only accepts columns,DataFrame.pivot_table
also accepts arrays, so theGroupBy.cumcount
can be passed directly as theindex
without creating an explicit column.df2.pivot_table(index=df2.groupby('A').cumcount(), columns='A', values='B')
A a b c
0 0.0 10.0 7.0
1 11.0 10.0 NaN
2 2.0 14.0 NaN
3 11.0 NaN NaN
Question 11
How do I flatten the multiple index to single index after
pivot
If columns
type object
with string join
df.columns = df.columns.map('|'.join)
else format
df.columns = df.columns.map('{0[0]}|{0[1]}'.format)
pivot pandas dataframe and count true and false values
You could melt it first in order to unpivot, then reconstruct the pivot table in the way you want.
import pandas as pd
df = pd.DataFrame({'year': {0: 2020, 1: 2020, 2: 2021},
'a': {0: True, 1: False, 2: False},
'b': {0: False, 1: False, 2: False},
'c': {0: True, 1: True, 2: True},
'd': {0: False, 1: False, 2: False},
'e': {0: True, 1: True, 2: True}})
df.melt(id_vars='year',
var_name='col').pivot_table(index=['year','col'],
columns='value',
aggfunc=lambda x:len(x), fill_value=0).add_prefix('count_')
Output
value count_False count_True
year col
2020 a 1 1
b 2 0
c 0 2
d 2 0
e 0 2
2021 a 1 0
b 1 0
c 0 1
d 1 0
e 0 1
Pivot pandas dataframe from single row to one row per item
Here's a way:
df = df.T
df.index = pd.MultiIndex.from_arrays([[x[x.find('.')+1:] for x in df.index], [x[:x.find('.')] for x in df.index]])
df = df.unstack()
Input:
id_prop.0 id_prop.1 id_prop.2 prop_number.0 prop_number.1 prop_number.2 prop_value.0 prop_value.1 prop_value.2
0 1 2 3 123 325 754 1 1 1
Output:
id_prop prop_number prop_value
0 1 123 1
1 2 325 1
2 3 754 1
Explanation:
- transpose so we can work with the index instead of the columns
- parse each label into the desired label (prefix) and the result number (suffix) split by the
.
character - update the df's index to be a MultiIndex with two levels: an list of result numbers and a list of desired labels
- call
unstack
to pivot a level of the MultiIndex (the desired labels) to be column headings
UPDATE: To handle labels where result number is the second .
separated token with additional tokens to its right (as described in OP's comment), we can do this:
import pandas as pd
json = {
"building.0.description.bedrooms":{"0":"qrs"},
"building.1.description.bedrooms":{"0":"tuv"},
"building.2.description.bedrooms":{"0":"xyz"},
"id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},
"prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},
"prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
print(df.to_string())
df = df.T
df.index = pd.MultiIndex.from_arrays([[x.split('.')[1] for x in df.index], ['.'.join(x.split('.')[0:1] + x.split('.')[2:]) for x in df.index]])
df = df.unstack()
df.columns = df.columns.get_level_values(1)
print(df)
Input:
building.0.description.bedrooms building.1.description.bedrooms building.2.description.bedrooms id_prop.0 id_prop.1 id_prop.2 prop_number.0 prop_number.1 prop_number.2 prop_value.0 prop_value.1 prop_value.2
0 qrs tuv xyz 1 2 3 123 325 754 1 1 1
Output:
building.description.bedrooms id_prop prop_number prop_value
0 qrs 1 123 1
1 tuv 2 325 1
2 xyz 3 754 1
Related Topics
How to Make Ipython Notebook Matplotlib Plot Inline
How to Set Sys.Stdout Encoding in Python 3
How to Use a Python Script in the Command Line Without Cd-Ing to Its Directory? Is It the Pythonpath
Convert Integer to String in Python
Importerror: Dll Load Failed: %1 Is Not a Valid Win32 Application. But the Dll's Are There
How to Convert a Utc Datetime to a Local Datetime Using Only Standard Library
Importerror: No Module Named 'Encodings'
How to Convert a Pil Image into a Numpy Array
How to Get the Original Variable Name of Variable Passed to a Function
How to Do a Recursive Sub-Folder Search and Return Files in a List
Error After Upgrading Pip: Cannot Import Name 'Main'
How to Find Median and Quantiles Using Spark
Python Recursion with List Returns None
Why Does Checking a Variable Against Multiple Values with 'Or' Only Check the First Value
How to Capture Stdout Output from a Python Function Call
How to Load Default Profile in Chrome Using Python Selenium Webdriver