Unnest (Explode) a Pandas Series

How to unnest (explode) a column in a pandas DataFrame, into multiple rows

I know object dtype columns makes the data hard to convert with pandas functions. When I receive data like this, the first thing that came to mind was to "flatten" or unnest the columns.

I am using pandas and Python functions for this type of question. If you are worried about the speed of the above solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.


Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:

df.explode('B')

A B
0 1 1
1 1 2
0 2 1
1 2 2

Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list

df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode('B')

A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN

Method 1
apply + pd.Series (easy to understand but in terms of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2
Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1
for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly the same as before, add reindex at the end.

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3
recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns, use

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4
using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5
when the list only contains unique values:

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2

Method 6
using numpy for high performance:

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2

Method 7
using base function itertools cycle and chain: Pure python solution just for fun

from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2

Generalizing to multiple columns

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function:

def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx

return df1.join(df.drop(explode, 1), how='left')

unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2


Column-wise Unnesting

All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4

Updated function

def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx

return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')

Test Output

unnesting(df, ['B','C'], axis=0)
Out[36]:
B0 B1 C0 C1 A
0 1 2 1 2 1
1 3 4 3 4 2

Update 2021-02-17 with original explode function

def unnesting(df, explode, axis):
if axis==1:
df1 = pd.concat([df[x].explode() for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')

Unnest (explode) a Pandas Series

Using list + str.join and np.repeat -

pd.DataFrame(
{
'col1' : list(''.join(df.col1)),
'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})

col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3

A generalised solution for any number of columns is easily achievable, without much change to the solution -

i = list(''.join(df.col1))
j = df.drop('col1', 1).values.repeat(df.col1.str.len(), axis=0)

df = pd.DataFrame(j, columns=df.columns.difference(['col1']))
df.insert(0, 'col1', i)

df

col1 col2
0 a 1
1 s 1
2 d 1
3 f 1
4 x 2
5 y 2
6 q 3

Performance

df = pd.concat([df] * 100000, ignore_index=True)

# MaxU's solution

%%timeit
df.col1.str.extractall(r'(.)') \
.reset_index(level=1, drop=True) \
.join(df['col2']) \
.reset_index(drop=True)

1 loop, best of 3: 1.98 s per loop

# piRSquared's solution

%%timeit
pd.DataFrame(
[[x] + b for a, *b in df.values for x in a],
columns=df.columns
)

1 loop, best of 3: 1.68 s per loop

# Wen's solution

%%timeit
v = df.col1.apply(list)
pd.DataFrame({'col1':np.concatenate(v.values),'col2':df.col2.repeat(v.apply(len))})

1 loop, best of 3: 835 ms per loop

# Alexander's solution

%%timeit
pd.DataFrame([(letter, i)
for letters, i in zip(df['col1'], df['col2'])
for letter in letters],
columns=df.columns)

1 loop, best of 3: 316 ms per loop

%%timeit
pd.DataFrame(
{
'col1' : list(''.join(df.col1)),
'col2' : df.col2.values.repeat(df.col1.str.len(), axis=0)
})

10 loops, best of 3: 124 ms per loop

I tried timing Vaishali's, but it took too long on this dataset.

Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

pandas >= 0.25

Assuming all columns have the same number of lists, you can call Series.explode on each column.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.


It's also faster.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))

2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

pandas DataFrame explode column contents

Here's how I would do it. First remove your 1 column (so we dont mess the naming):

df['id'] = df[1]
df = df.drop(1, axis = 1)

Then create an objs, with what we want to concat, and concat:

objs = [df, pd.DataFrame(df[0].tolist())]
pd.concat(objs, axis=1)

0 id 0 1 2
0 [1, 2, 3] 1 1 2 3
1 [2, 2, 1] 1 2 2 1
2 [1, 2, 1] 1 1 2 1

Can you separate a list of strings in a Pandas column to individual strings per row

You can do that using df.explode('colG')

Pandas explode multiple columns

You could set col1 as index and apply pd.Series.explode across the columns:

df.set_index('col1').apply(pd.Series.explode).reset_index()

Or:

df.apply(pd.Series.explode)

col1 col2 col3
0 aa 1 1.1
1 aa 2 2.2
2 aa 3 3.3
3 bb 4 4.4
4 bb 5 5.5
5 bb 6 6.6
6 cc 7 7.7
7 cc 8 8.8
8 cc 9 9.9
9 cc 7 7.7
10 cc 8 8.8
11 cc 9 9.9

How to convert dataframe columns with list of values into rows in Pandas DataFrame

You can use explode method chained like this,

df.explode('A').explode('B').explode('C').reset_index(drop=True)

A B C
0 X 1 aa
1 X 1 bb
2 X 1 cc
3 Y 2 xx
4 Y 2 yy

Alternatively, you can apply pd.Series.explode on the dataframe like this,

df.apply(pd.Series.explode).reset_index(drop=True)

In pandas 1.3+ you can use a list of columns to explode on,

So the code will look like,

df.explode(['A', 'B', 'C']).reset_index(drop=True)


Related Topics



Leave a reply



Submit