Reshape Wide to Long in Pandas

Reshape wide to long in pandas

Update

As George Liu has shown in another answer, pd.melt is the idiomatic, flexible and fast solution to this problem. Do not use unstack for this.


unstack returns a series with a multiindex:

    In [38]: df.unstack()
Out[38]:
date
AA 05/03 1
06/03 4
07/03 7
08/03 5
BB 05/03 2
06/03 5
07/03 8
08/03 7
CC 05/03 3
06/03 6
07/03 9
08/03 1
dtype: int64

You can call reset_index on the returning series:

In [39]: df.unstack().reset_index() 
Out[39]:

level_0 date 0
0 AA 05-03 1
1 AA 06-03 4
2 AA 07-03 7
3 AA 08-03 5
4 BB 05-03 2
5 BB 06-03 5
6 BB 07-03 8
7 BB 08-03 7
8 CC 05-03 3
9 CC 06-03 6
10 CC 07-03 9
11 CC 08-03 1

Or construct a dataframe with a multiindex:

In [40]: pd.DataFrame(df.unstack())     
Out[40]:

0
date
AA 05-03 1
06-03 4
07-03 7
08-03 5
BB 05-03 2
06-03 5
07-03 8
08-03 7
CC 05-03 3
06-03 6
07-03 9
08-03 1

Pandas long to wide reshape, by two variables

A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:

df['idx'] = df.groupby('Salesman').cumcount()

Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:

print df.pivot(index='Salesman',columns='idx')[['product','price']]

product price
idx 0 1 2 0 1 2
Salesman
Knut bat ball wand 5 1 3
Steve pen NaN NaN 2 NaN NaN

To get closer to your desired output I added the following:

df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)

product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')

reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape

product_0 product_1 product_2 price_0 price_1 price_2 Height
Salesman
Knut bat ball wand 5 1 3 6
Steve pen NaN NaN 2 NaN NaN 5

Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):

df['idx'] = df.groupby('Salesman').cumcount()

tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))

reshape = pd.concat(tmp,axis=1)

@Luke said:

I think Stata can do something like this with the reshape command.

You can but I think you also need a within group counter to get the reshape in stata to get your desired output:

     +-------------------------------------------+
| salesman idx height product price |
|-------------------------------------------|
1. | Knut 0 6 bat 5 |
2. | Knut 1 6 ball 1 |
3. | Knut 2 6 wand 3 |
4. | Steve 0 5 pen 2 |
+-------------------------------------------+

If you add idx then you could do reshape in stata:

reshape wide product price, i(salesman) j(idx)

Reshape wide to long for many columns with a common prefix

A couple of options to do this:

with pd.wide_to_long, you need to reorder the positions based on the delimiter; in this case we move the a, b, ... to the fore and the p1, p2 to the back, before reshaping:

temp = df.copy()
temp = temp.rename(columns = lambda df: ".".join(df.split(".")[::-1]))
(pd.wide_to_long(temp.reset_index(),
stubnames = ["a", "b", "c", "d", "e", "f", "g"],
sep=".",
suffix=".+",
i = "index",
j = "side")
.droplevel('index')
.reset_index()

side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

One limitation with pd.wide_to_long is the reshaping of positions. The other limitation is that the stubnames have to be explicitly specified.

Another option is via stack, where the columns are split, based on the delimiter and reshaped:

temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)

temp.stack(0).droplevel(0).rename_axis('side').reset_index()

side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72

stack is quite flexible, and did not require us to list the column names. The limitation of stack is that it fails if the index is not unique.

Another option is pivot_longer from pyjanitor, which abstracts the process:

# pip install janitor
import janitor
df.pivot_longer(index = None,
names_to = ("side", ".value"),
names_sep=".")

side a b c d e f g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

The worker here is .value. This tells the code that anything after . should remain as column names, while anything before . should be collated into a new column (side). Note that, unlike wide_to_long, the stubnames do not need to be stated - it abstracts that for us. Also, it can handle duplicate indices, since it uses pd.melt under the hood.
One limitation of pivot_longer is that you have to install the pyjanitor library.

For the other example, I'll use stack and pivot_longer; you can still use pd.wide_to_long to solve it.

With stack:

first split the columns and convert into a MultiIndex:

temp = df.copy()
temp.columns = temp.columns.str.split(".", expand = True)

Reshape the data:

temp = temp.stack(1).droplevel(0).rename_axis('side')

Merge the column names:

temp.columns = temp.columns.map(".".join)

Reset the index:

temp.reset_index()

side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p2 0 3 6 9 12 15 18
2 p1 0 4 8 12 16 20 24
3 p2 0 12 24 36 48 60 72

With pivot_longer, one option is to reorder the columns, before reshaping:

temp = df.copy()

temp.columns = ["".join([first, last, middle])
for first, middle, last in
temp.columns.str.split(r'(\.p\d)')]

(
temp
.pivot_longer(
index = None,
names_to = ('.value', 'side'),
names_pattern = r"(.+)\.(p\d)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

In the dev version however, the column reorder is not necessary; we can simply use multiple .value to reshape the dataframe - note that you'll have to install from the repo to get the latest dev version:

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_pattern = r"(.+)\.(.\d)(.+)")
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

Another option with names_sep:

(df
.pivot_longer(
index = None,
names_to = ('.value', 'side', '.value'),
names_sep = r'\.(p\d)')
)
side foo.a foo.b foo.c foo.d foo.e foo.f foo.g
0 p1 4 1 2 3 4 5 6
1 p1 0 4 8 12 16 20 24
2 p2 0 3 6 9 12 15 18
3 p2 0 12 24 36 48 60 72

Pandas long to wide (unmelt or similar?)

  1. Make a helper column that assigns which rows are associated with name-1/val-1, and name-2/val-2
  2. Using this new column we can now pivot the data without encountering that "duplicate entries" error

Clean the output:


  1. sort the columns to match your output
  2. The pivot will return a dataframe with a columnar multiindex. We'll need to flatten this by combining levels

pandas > 1.0.5

id_vars = ["varA", "varB"]

# Create an id column that tracks the id_vars
# Pivot using the id column
# Sort the columns to match OP expected output
pivoted_df = (
df.assign(
id=df.groupby(id_vars).cumcount().add(1).astype(str)
)
.pivot(index=id_vars, columns="id", values=["name", "val"])
.sort_index(level=1, axis=1)
)

# flatten the column multiindex, insert the row index as values
flattened_columns = pivoted_df.columns.map("-".join)
pivoted_df = (
pivoted_df.set_axis(flattened_columns, axis=1)
.reset_index()
)

print(pivoted_df)
varA varB name-1 val-1 name-2 val-2
0 1 200 A 4 B 1
1 2 250 A 8 B 0

pandas <= 1.0.5

  • pivot in these versions did not support using lists as arguments. A workaround is to instead use the pivot_table with a nonaggregating aggfunc prevent accidental aggregation.
id_vars = ["varA", "varB"]

# Create an id column that tracks the id_vars
# Pivot using the id column
# Sort the columns to match OP expected output
pivoted_df = (
df.assign(
id=df.groupby(id_vars).cumcount().add(1).astype(str)
)
.pivot_table(index=id_vars, columns="id", values=["name", "val"], aggfunc=lambda x: x)
.sort_index(level=1, axis=1)
)

# flatten the column multiindex, insert the row index as values
flattened_columns = pivoted_df.columns.map("-".join)
pivoted_df = (
pivoted_df.set_axis(flattened_columns, axis=1)
.reset_index()
)

Wide to long data transform in pandas

You can use melt for reshaping, then split column variable and drop and sort_values. I think you can cast column year to int by astype and last change order of columns by subset:

df1 = (pd.melt(df, id_vars=['county','area'], value_name='pop'))
df1[['tmp','year']] = df1.variable.str.split('_', expand=True)
df1 = df1.drop(['variable', 'tmp'],axis=1).sort_values(['county','year'])
df1['year'] = df1.year.astype(int)
df1 = df1[['county','year','pop','area']]
print (df1)
county year pop area
0 1001 2006 1037 275
3 1001 2007 1052 275
6 1001 2008 1102 275
1 1003 2006 2399 394
4 1003 2007 2424 394
7 1003 2008 2438 394
2 1005 2006 1638 312
5 1005 2007 1647 312
8 1005 2008 1660 312

print (df1.dtypes)
county int64
year int32
pop int64
area int64
dtype: object

Another solution with set_index, stack and reset_index:

df2 = df.set_index(['county','area']).stack().reset_index(name='pop')
df2[['tmp','year']] = df2.level_2.str.split('_', expand=True)
df2 = df2.drop(['level_2', 'tmp'],axis=1)
df2['year'] = df2.year.astype(int)
df2 = df2[['county','year','pop','area']]

print (df2)
county year pop area
0 1001 2006 1037 275
1 1001 2007 1052 275
2 1001 2008 1102 275
3 1003 2006 2399 394
4 1003 2007 2424 394
5 1003 2008 2438 394
6 1005 2006 1638 312
7 1005 2007 1647 312
8 1005 2008 1660 312

print (df2.dtypes)
county int64
year int32
pop int64
area int64
dtype: object

python pandas reshape data from long to wide

Group the dataframe by id and status columns, then take the first values for value column, finally unstack the resulting series:

>>> df.groupby(['id', 'status']).value.first().unstack().reset_index()

status id item_ordered item_received item_setup
0 1 complete complete complete
1 2 complete NaN complete

Complicated (for me) reshaping from wide to long in Pandas

In fact, pandas has a wide_to_long command that can conveniently do what you intend to do.

df = pd.DataFrame( {'location' : ['A', 'A', 'A', 'B', 'B', 'B'], 
'dist_to_A' : [0, 0, 0, 50, 50, 50],
'dist_to_B' : [50, 50, 50, 0, 0, 0],
'location_var': [10, 10, 10, 14, 14, 14],
'ind_var': [3, 8, 10, 1, 3, 4]})

df['ind'] = df.index

#The `location` and `location_var` corresponds to the choices,
#record them as dictionaries and drop them
#(Just realized you had a cleaner way, copied from yous).

ind_to_loc = dict(df['location'])
loc_dict = dict(df.groupby('location').agg(lambda x : int(np.mean(x)))['location_var'])
df.drop(['location_var', 'location'], axis = 1, inplace = True)
# now reshape
df_long = pd.wide_to_long(df, ['dist_to_'], i = 'ind', j = 'location')

# use the dictionaries to get variables `choice` and `location_var` back.

df_long['choice'] = df_long.index.map(lambda x: ind_to_loc[x[0]])
df_long['location_var'] = df_long.index.map(lambda x : loc_dict[x[1]])
print df_long.sort()

This gives you the table you asked for:

              ind_var  dist_to_ choice  location_var
ind location
0 A 3 0 A 10
B 3 50 A 14
1 A 8 0 A 10
B 8 50 A 14
2 A 10 0 A 10
B 10 50 A 14
3 A 1 50 B 10
B 1 0 B 14
4 A 3 50 B 10
B 3 0 B 14
5 A 4 50 B 10
B 4 0 B 14

Of course you can generate a choice variable that takes 0 and 1 if that's what you want.

Convert dataframe from wide to long - pandas

There is pandas.wide_to_long, which is nice when the columns have stubs like that.

import pandas as pd

df.reset_index(inplace=True,drop=True)
df['id'] = df.index
df = pd.wide_to_long(df, stubnames='S_', i='id', j='num').reset_index().rename(columns={'S_':'S'})

# id num index S
#0 0 1 0 1.0
#1 1 1 1 1.0
#2 0 2 0 0.0
#3 1 2 1 1.0
#4 0 3 0 0.0
#5 1 3 1 NaN
#6 0 4 0 1.0
#7 1 4 1 NaN

Reshaping long format dataframe to wide format according to the value of the elements in columns

Reshape the dataframe using pivot then subtract 5 from all the values and add prefix of n to column names:

df.pivot('group', 'ID', 'rank').rsub(5).add_prefix('n')


ID      n1   n2   n3   n4
group
1 3.0 2.0 4.0 1.0
2 3.0 4.0 NaN NaN
3 4.0 1.0 2.0 3.0
4 4.0 NaN NaN NaN
5 2.0 3.0 4.0 NaN
6 4.0 NaN NaN NaN


Related Topics



Leave a reply



Submit