How to Unnest (Explode) a Column in a Pandas Dataframe, into Multiple Rows

How to unnest (explode) a column in a pandas DataFrame, into multiple rows

I know object dtype columns makes the data hard to convert with pandas functions. When I receive data like this, the first thing that came to mind was to "flatten" or unnest the columns.

I am using pandas and Python functions for this type of question. If you are worried about the speed of the above solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.


Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode function:

df.explode('B')

A B
0 1 1
1 1 2
0 2 1
1 2 2

Given a dataframe with an empty list or a NaN in the column. An empty list will not cause an issue, but a NaN will need to be filled with a list

df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode('B')

A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN

Method 1
apply + pd.Series (easy to understand but in terms of performance not recommended . )

df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2

Method 2
Using repeat with DataFrame constructor , re-create your dataframe (good at performance, not good at multiple columns )

df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2

Method 2.1
for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .

Solution : join or merge with the index after 'unnest' the single columns

s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2

If you need the column order exactly the same as before, add reindex at the end.

s.join(df.drop('B',1),how='left').reindex(columns=df.columns)

Method 3
recreate the list

pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2

If more than two columns, use

s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]

Method 4
using reindex or loc

df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2

#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))

Method 5
when the list only contains unique values:

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2

Method 6
using numpy for high performance:

newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2

Method 7
using base function itertools cycle and chain: Pure python solution just for fun

from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2

Generalizing to multiple columns

df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]

Self-def function:

def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx

return df1.join(df.drop(explode, 1), how='left')


unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2


Column-wise Unnesting

All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame constructor

df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4

Updated function

def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx

return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')

Test Output

unnesting(df, ['B','C'], axis=0)
Out[36]:
B0 B1 C0 C1 A
0 1 2 1 2 1
1 3 4 3 4 2

Update 2021-02-17 with original explode function

def unnesting(df, explode, axis):
if axis==1:
df1 = pd.concat([df[x].explode() for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')

Efficient way to unnest (explode) multiple list columns in a pandas DataFrame

pandas >= 0.25

Assuming all columns have the same number of lists, you can call Series.explode on each column.

df.set_index(['A']).apply(pd.Series.explode).reset_index()

A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8

The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.


It's also faster.

%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))


2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Splitting multiple columns into rows in pandas dataframe

You can first split columns, create Series by stack and remove whitespaces by strip:

s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)

Then concat both Series to df1:

df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])

Remove old columns value and date and join:

print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
ticker account value date
0 aa assets 100 20121231
1 aa assets 200 20131231
2 bb liabilities 50 20141231
3 bb liabilities 150 20131231

Explode cell containing dict into multiple rows in Pandas

You can use explode after getting the keys(). keys() returns a tuple, so I enclose with [*] to transform to a list, which is the format required for explode. You can also use list():

df = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'CODES' : [{"1407273790":5,"1801032636":20,"1174813554":1,"1215470448":2,"1053754655":4,"1891751228":1},
{"1497066526":19,"1639360563":16,"1235107087":11,"1033522925":18},
{"1154348191":8,"1568410355":4},]})
df['CODES'] = df['CODES'].apply(lambda x: [*x.keys()]) # or lambda x: list(x.keys()))
df = df.explode('CODES')
df
Out[1]:
ID CODES
0 A 1407273790
0 A 1801032636
0 A 1174813554
0 A 1215470448
0 A 1053754655
0 A 1891751228
1 B 1497066526
1 B 1639360563
1 B 1235107087
1 B 1033522925
2 C 1154348191
2 C 1568410355

Per SammyWemmy's comment, you can try the performance of two methods with:

%timeit df['CODES'].apply(lambda x: list(x.keys()))
%timeit [entry.keys() for entry in df.CODES]

How to do explode and keep a fair proportion of number value for each new row Pandas

You can explode "A"; then groupby the index and transform count method (to count the number of each index) and divide the elements in 'B' by their corresponding index count.

out = df.explode('A')
out['B'] /= out['B'].groupby(level=0).transform('count')

Output:

     A         B
0 0 0.333333
0 1 0.333333
0 2 0.333333
1 foo 1.000000
2 3 0.500000
2 4 0.500000

Splitting a pandas column on multiple delimiters and then converting them into rows

IIUC, you could use a double split with a regex:

df['str'] = df['value'].str.split(r',(?=[^/,]+/)')
df = df.explode('str')
df[['str', 'value']] = df['str'].str.split('/', expand=True)

output:

   ID                    value   str
0 1 value1 str1
0 1 value21,value22,value23 str2
0 1 value31,value32 str3
1 2 value4 str4
1 2 value51,value52,value53 str5
1 2 value61,value62 str6

Or, generating a new dataframe, with extractall and named capturing groups + a join:

df2 = (df[['ID']]
.join(df['value']
.str.extractall(r'(?P<str>[^,/]+)/(?P<value>[^/]+?)(?=,[^,]+/|$)')
.droplevel(1))
)

output:

   ID   str                    value
0 1 str1 value1
0 1 str2 value21,value22,value23
0 1 str3 value31,value32
1 2 str4 value4
1 2 str5 value51,value52,value53
1 2 str6 value61,value62

regex demo



Related Topics



Leave a reply



Submit