Split a Pandas Column of Lists into Multiple Columns

Split a Pandas column of lists into multiple columns

You can use the DataFrame constructor with lists created by to_list:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
                ['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
       teams
0  [SF, NYG]
1  [SF, NYG]
2  [SF, NYG]
3  [SF, NYG]
4  [SF, NYG]
5  [SF, NYG]
6  [SF, NYG]

df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
       teams team1 team2
0  [SF, NYG]    SF   NYG
1  [SF, NYG]    SF   NYG
2  [SF, NYG]    SF   NYG
3  [SF, NYG]    SF   NYG
4  [SF, NYG]    SF   NYG
5  [SF, NYG]    SF   NYG
6  [SF, NYG]    SF   NYG

And for a new DataFrame:

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
  team1 team2
0    SF   NYG
1    SF   NYG
2    SF   NYG
3    SF   NYG
4    SF   NYG
5    SF   NYG
6    SF   NYG

A solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Split list in a column to multiple columns

You could map ast.literal_eval to items in df2["1"]; build a DataFrame and join it to df1:

import ast
out = df1.join(pd.DataFrame(map(ast.literal_eval, df2["1"].tolist())).add_prefix('feature_'))

Output:

                          Text    Topic  feature_0  feature_1  feature_2
0  Where is the party tonight?    Party  -0.011571  -0.010117   0.062448
1                  Let's dance    Party  -0.082682  -0.001614   0.020942
2                  Hello world    Other  -0.063768  -0.015903   0.020942
3            It is rainy today  Weather   0.063796  -0.028781   0.056791

Splitting a list in a Pandas cell into multiple columns

You can loop through the Series with apply() function and convert each list to a Series, this automatically expand the list as a series in the column direction:

df[0].apply(pd.Series)

#   0    1   2
#0  8   10  12
#1  7    9  11

Update: To keep other columns of the data frame, you can concatenate the result with the columns you want to keep:

pd.concat([df[0].apply(pd.Series), df[1]], axis = 1)

#   0    1   2  1
#0  8   10  12  A
#1  7    9  11  B

Split a koalas column of lists into multiple columns

One way I found to use only functions that operate on workers and do not collect all the data to the driver is

df['teams'] \
  .astype(str) \
  .str.replace('\[|\]', '') \
  .str.split(pat=',', n=1, expand=True)

#     0     1
# 0  SF   NYG
# 1  SF   NYG
# 2  SF   NYG
# 3  SF   NYG
# 4  SF   NYG
# 5  SF   NYG
# 6  SF   NYG

I had to transform the column as type string because it was a numpy array, and pyspark couldn't operate with it.

To get the initial dataframe along its other columns, you can use a simple concat:

import databricks.koalas as ks

ks.concat([
  df['teams'].astype(str).str.replace('\[|\]', '').str.split(pat=',', n=1, expand=True),
  df.drop(columns='teams')
], axis=1)

#     0     1  teams1
# 0  SF   NYG       2
# 1  SF   NYG       2
# 2  SF   NYG       1
# 3  SF   NYG       1
# 4  SF   NYG       7
# 5  SF   NYG       8
# 6  SF   NYG       6

Pandas split a column of unequal length lists into multiple boolean columns

Alternative approach using str.get_dummies probably more efficient than apply + pd.Series:

df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)

       a      b      c      d      e
0   True   True  False  False  False
1  False  False   True  False  False
2   True   True  False   True  False
3  False  False  False  False   True

Timings:

df1.shape
(40000, 2)

%%timeit
df1['col2'].str.join(',').str.get_dummies(sep=',').astype(bool)
286 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
pd.get_dummies(df1['col2'].apply(pd.Series).stack()).sum(level=0)
9.43 s ± 499 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Split a Pandas Column of Lists into Multiple Columns