How to unnest (explode) a column in a pandas DataFrame, into multiple rows
I know object
dtype columns makes the data hard to convert with pandas functions. When I receive data like this, the first thing that came to mind was to "flatten" or unnest the columns.
I am using pandas and Python functions for this type of question. If you are worried about the speed of the above solutions, check out user3483203's answer, since it's using numpy and most of the time numpy is faster. I recommend Cython or numba if speed matters.
Method 0 [pandas >= 0.25]
Starting from pandas 0.25, if you only need to explode one column, you can use the pandas.DataFrame.explode
function:
df.explode('B')
A B
0 1 1
1 1 2
0 2 1
1 2 2
Given a dataframe with an empty list
or a NaN
in the column. An empty list will not cause an issue, but a NaN
will need to be filled with a list
df = pd.DataFrame({'A': [1, 2, 3, 4],'B': [[1, 2], [1, 2], [], np.nan]})
df.B = df.B.fillna({i: [] for i in df.index}) # replace NaN with []
df.explode('B')
A B
0 1 1
0 1 2
1 2 1
1 2 2
2 3 NaN
3 4 NaN
Method 1apply + pd.Series
(easy to understand but in terms of performance not recommended . )
df.set_index('A').B.apply(pd.Series).stack().reset_index(level=0).rename(columns={0:'B'})
Out[463]:
A B
0 1 1
1 1 2
0 2 1
1 2 2
Method 2
Using repeat
with DataFrame
constructor , re-create your dataframe (good at performance, not good at multiple columns )
df=pd.DataFrame({'A':df.A.repeat(df.B.str.len()),'B':np.concatenate(df.B.values)})
df
Out[465]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
Method 2.1
for example besides A we have A.1 .....A.n. If we still use the method(Method 2) above it is hard for us to re-create the columns one by one .
Solution : join
or merge
with the index
after 'unnest' the single columns
s=pd.DataFrame({'B':np.concatenate(df.B.values)},index=df.index.repeat(df.B.str.len()))
s.join(df.drop('B',1),how='left')
Out[477]:
B A
0 1 1
0 2 1
1 1 2
1 2 2
If you need the column order exactly the same as before, add reindex
at the end.
s.join(df.drop('B',1),how='left').reindex(columns=df.columns)
Method 3
recreate the list
pd.DataFrame([[x] + [z] for x, y in df.values for z in y],columns=df.columns)
Out[488]:
A B
0 1 1
1 1 2
2 2 1
3 2 2
If more than two columns, use
s=pd.DataFrame([[x] + [z] for x, y in zip(df.index,df.B) for z in y])
s.merge(df,left_on=0,right_index=True)
Out[491]:
0 1 A B
0 0 1 1 [1, 2]
1 0 2 1 [1, 2]
2 1 1 2 [1, 2]
3 1 2 2 [1, 2]
Method 4
using reindex
or loc
df.reindex(df.index.repeat(df.B.str.len())).assign(B=np.concatenate(df.B.values))
Out[554]:
A B
0 1 1
0 1 2
1 2 1
1 2 2
#df.loc[df.index.repeat(df.B.str.len())].assign(B=np.concatenate(df.B.values))
Method 5
when the list only contains unique values:
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]]})
from collections import ChainMap
d = dict(ChainMap(*map(dict.fromkeys, df['B'], df['A'])))
pd.DataFrame(list(d.items()),columns=df.columns[::-1])
Out[574]:
B A
0 1 1
1 2 1
2 3 2
3 4 2
Method 6
using numpy
for high performance:
newvalues=np.dstack((np.repeat(df.A.values,list(map(len,df.B.values))),np.concatenate(df.B.values)))
pd.DataFrame(data=newvalues[0],columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Method 7
using base function itertools
cycle
and chain
: Pure python solution just for fun
from itertools import cycle,chain
l=df.values.tolist()
l1=[list(zip([x[0]], cycle(x[1])) if len([x[0]]) > len(x[1]) else list(zip(cycle([x[0]]), x[1]))) for x in l]
pd.DataFrame(list(chain.from_iterable(l1)),columns=df.columns)
A B
0 1 1
1 1 2
2 2 1
3 2 2
Generalizing to multiple columns
df=pd.DataFrame({'A':[1,2],'B':[[1,2],[3,4]],'C':[[1,2],[3,4]]})
df
Out[592]:
A B C
0 1 [1, 2] [1, 2]
1 2 [3, 4] [3, 4]
Self-def function:
def unnesting(df, explode):
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
unnesting(df,['B','C'])
Out[609]:
B C A
0 1 1 1
0 2 2 1
1 3 3 2
1 4 4 2
Column-wise Unnesting
All above method is talking about the vertical unnesting and explode , If you do need expend the list horizontal, Check with pd.DataFrame
constructor
df.join(pd.DataFrame(df.B.tolist(),index=df.index).add_prefix('B_'))
Out[33]:
A B C B_0 B_1
0 1 [1, 2] [1, 2] 1 2
1 2 [3, 4] [3, 4] 3 4
Updated function
def unnesting(df, explode, axis):
if axis==1:
idx = df.index.repeat(df[explode[0]].str.len())
df1 = pd.concat([
pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
df1.index = idx
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
Test Output
unnesting(df, ['B','C'], axis=0)
Out[36]:
B0 B1 C0 C1 A
0 1 2 1 2 1
1 3 4 3 4 2
Update 2021-02-17 with original explode function
def unnesting(df, explode, axis):
if axis==1:
df1 = pd.concat([df[x].explode() for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
else :
df1 = pd.concat([
pd.DataFrame(df[x].tolist(), index=df.index).add_prefix(x) for x in explode], axis=1)
return df1.join(df.drop(explode, 1), how='left')
Efficient way to unnest (explode) multiple list columns in a pandas DataFrame
pandas >= 0.25
Assuming all columns have the same number of lists, you can call Series.explode
on each column.
df.set_index(['A']).apply(pd.Series.explode).reset_index()
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.
It's also faster.
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Splitting multiple columns into rows in pandas dataframe
You can first split
columns, create Series
by stack
and remove whitespaces by strip
:
s1 = df.value.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
s2 = df.date.str.split(',', expand=True).stack().str.strip().reset_index(level=1, drop=True)
Then concat
both Series
to df1
:
df1 = pd.concat([s1,s2], axis=1, keys=['value','date'])
Remove old columns value
and date
and join
:
print (df.drop(['value','date'], axis=1).join(df1).reset_index(drop=True))
ticker account value date
0 aa assets 100 20121231
1 aa assets 200 20131231
2 bb liabilities 50 20141231
3 bb liabilities 150 20131231
Explode cell containing dict into multiple rows in Pandas
You can use explode
after getting the keys()
. keys()
returns a tuple
, so I enclose with [*]
to transform to a list, which is the format required for explode
. You can also use list()
:
df = pd.DataFrame({'ID' : ['A', 'B', 'C'],
'CODES' : [{"1407273790":5,"1801032636":20,"1174813554":1,"1215470448":2,"1053754655":4,"1891751228":1},
{"1497066526":19,"1639360563":16,"1235107087":11,"1033522925":18},
{"1154348191":8,"1568410355":4},]})
df['CODES'] = df['CODES'].apply(lambda x: [*x.keys()]) # or lambda x: list(x.keys()))
df = df.explode('CODES')
df
Out[1]:
ID CODES
0 A 1407273790
0 A 1801032636
0 A 1174813554
0 A 1215470448
0 A 1053754655
0 A 1891751228
1 B 1497066526
1 B 1639360563
1 B 1235107087
1 B 1033522925
2 C 1154348191
2 C 1568410355
Per SammyWemmy's comment, you can try the performance of two methods with:
%timeit df['CODES'].apply(lambda x: list(x.keys()))
%timeit [entry.keys() for entry in df.CODES]
How to do explode and keep a fair proportion of number value for each new row Pandas
You can explode
"A"; then groupby
the index and transform
count
method (to count the number of each index) and divide the elements in 'B'
by their corresponding index count.
out = df.explode('A')
out['B'] /= out['B'].groupby(level=0).transform('count')
Output:
A B
0 0 0.333333
0 1 0.333333
0 2 0.333333
1 foo 1.000000
2 3 0.500000
2 4 0.500000
Splitting a pandas column on multiple delimiters and then converting them into rows
IIUC, you could use a double split with a regex:
df['str'] = df['value'].str.split(r',(?=[^/,]+/)')
df = df.explode('str')
df[['str', 'value']] = df['str'].str.split('/', expand=True)
output:
ID value str
0 1 value1 str1
0 1 value21,value22,value23 str2
0 1 value31,value32 str3
1 2 value4 str4
1 2 value51,value52,value53 str5
1 2 value61,value62 str6
Or, generating a new dataframe, with extractall
and named capturing groups + a join
:
df2 = (df[['ID']]
.join(df['value']
.str.extractall(r'(?P<str>[^,/]+)/(?P<value>[^/]+?)(?=,[^,]+/|$)')
.droplevel(1))
)
output:
ID str value
0 1 str1 value1
0 1 str2 value21,value22,value23
0 1 str3 value31,value32
1 2 str4 value4
1 2 str5 value51,value52,value53
1 2 str6 value61,value62
regex demo
Related Topics
Why Does Comparing Strings Using Either '==' or 'Is' Sometimes Produce a Different Result
Running Shell Command and Capturing the Output
What Exactly Do "U" and "R" String Prefixes Do, and What Are Raw String Literals
Evaluating a Mathematical Expression in a String
Why Do I Get Attributeerror: 'Nonetype' Object Has No Attribute 'Something'
How to Change the Size of Figures Drawn With Matplotlib
Adding a Method to an Existing Object Instance
Best Way to Strip Punctuation from a String
How to Copy a Dictionary and Only Edit the Copy
Using Python'S Eval() Vs. Ast.Literal_Eval()
Split a Pandas Column of Lists into Multiple Columns
Multiprocessing VS Threading Python
Configure Flask Dev Server to Be Visible Across the Network
Difference Between @Staticmethod and @Classmethod
Get Statistics For Each Group (Such as Count, Mean, etc) Using Pandas Groupby
How to Sort a Dictionary by Key
Understanding Python Super() With _Init_() Methods
How to Sort a List of Dictionaries by a Value of the Dictionary