How Split Column of List-Values into Multiple Columns

Split a Pandas column of lists into multiple columns

You can use the DataFrame constructor with lists created by to_list:

import pandas as pd

d1 = {'teams': [['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],
['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG'],['SF', 'NYG']]}
df2 = pd.DataFrame(d1)
print (df2)
teams
0 [SF, NYG]
1 [SF, NYG]
2 [SF, NYG]
3 [SF, NYG]
4 [SF, NYG]
5 [SF, NYG]
6 [SF, NYG]


df2[['team1','team2']] = pd.DataFrame(df2.teams.tolist(), index= df2.index)
print (df2)
teams team1 team2
0 [SF, NYG] SF NYG
1 [SF, NYG] SF NYG
2 [SF, NYG] SF NYG
3 [SF, NYG] SF NYG
4 [SF, NYG] SF NYG
5 [SF, NYG] SF NYG
6 [SF, NYG] SF NYG

And for a new DataFrame:

df3 = pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
print (df3)
team1 team2
0 SF NYG
1 SF NYG
2 SF NYG
3 SF NYG
4 SF NYG
5 SF NYG
6 SF NYG

A solution with apply(pd.Series) is very slow:

#7k rows
df2 = pd.concat([df2]*1000).reset_index(drop=True)

In [121]: %timeit df2['teams'].apply(pd.Series)
1.79 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [122]: %timeit pd.DataFrame(df2['teams'].to_list(), columns=['team1','team2'])
1.63 ms ± 54.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Split list in a column to multiple columns

You could map ast.literal_eval to items in df2["1"]; build a DataFrame and join it to df1:

import ast
out = df1.join(pd.DataFrame(map(ast.literal_eval, df2["1"].tolist())).add_prefix('feature_'))

Output:

                          Text    Topic  feature_0  feature_1  feature_2
0 Where is the party tonight? Party -0.011571 -0.010117 0.062448
1 Let's dance Party -0.082682 -0.001614 0.020942
2 Hello world Other -0.063768 -0.015903 0.020942
3 It is rainy today Weather 0.063796 -0.028781 0.056791

Splitting a list in a Pandas cell into multiple columns

You can loop through the Series with apply() function and convert each list to a Series, this automatically expand the list as a series in the column direction:

df[0].apply(pd.Series)

# 0 1 2
#0 8 10 12
#1 7 9 11

Update: To keep other columns of the data frame, you can concatenate the result with the columns you want to keep:

pd.concat([df[0].apply(pd.Series), df[1]], axis = 1)

# 0 1 2 1
#0 8 10 12 A
#1 7 9 11 B

Pandas: split column of lists of unequal length into multiple columns

Try:

pd.DataFrame(df.codes.values.tolist()).add_prefix('code_')

code_0 code_1 code_2
0 71020 NaN NaN
1 77085 NaN NaN
2 36415 NaN NaN
3 99213 99287.0 NaN
4 99233 99233.0 99233.0

Include the index

pd.DataFrame(df.codes.values.tolist(), df.index).add_prefix('code_')

code_0 code_1 code_2
1 71020 NaN NaN
2 77085 NaN NaN
3 36415 NaN NaN
4 99213 99287.0 NaN
5 99233 99233.0 99233.0

We can nail down all the formatting with this:

f = lambda x: 'code_{}'.format(x + 1)
pd.DataFrame(
df.codes.values.tolist(),
df.index, dtype=object
).fillna('').rename(columns=f)

code_1 code_2 code_3
1 71020
2 77085
3 36415
4 99213 99287
5 99233 99233 99233

split one column into multiple columns usining delimiter

Use str.split to split

df[['date', 'date2', 'date3']] = df['date'].replace('NULL', np.nan).str.split('+', expand=True)

and count to count

df['number of dates'] = df[['date', 'date2', 'date3']].count(axis=1)

print(df)

ID date date2 date3 number of dates
0 3009 2016 2017 None 2
1 129 2015 None None 1
2 119 2014 2019 2020 3
3 120 2020 None None 1
4 121 NaN NaN NaN 0

How to split a pandas column with a list of dicts into separate columns for each key

  • The columns are lists of dicts.
    • Each dict in the list can be moved to a separate column by using pandas.explode().
    • Convert the column of dicts to a dataframe where the keys are column headers and the values are observations, by using pandas.json_normalize(), .join() this back to df.
  • Use .drop() to remove the unneeded column.
  • If the column contains list of dicts that are strings (e.g. "[{key: value}]"), refer to this solution in Splitting dictionary/list inside a Pandas Column into Separate Columns, and use:
    • df.col2 = df.col2.apply(literal_eval), with from ast import literal_eval.
import pandas as pd

# create sample dataframe
df = pd.DataFrame({'col1': ['x', 'y'], 'col2': [[{"target": "NAge", "segment": "21 and older"}, {"target": "MinAge", "segment": "21"}, {"target": "Retargeting", "segment": "people who may be similar to their customers"}, {"target": "Region", "segment": "the United States"}], [{"target": "NAge", "segment": "18 and older"}, {"target": "Location Type", "segment": "HOME"}, {"target": "Interest", "segment": "Hispanic culture"}, {"target": "Interest", "segment": "Republican Party (United States)"}, {"target": "Location Granularity", "segment": "country"}, {"target": "Country", "segment": "the United States"}, {"target": "MinAge", "segment": 18}]]})

# display(df)
col1 col2
0 x [{'target': 'NAge', 'segment': '21 and older'}, {'target': 'MinAge', 'segment': '21'}, {'target': 'Retargeting', 'segment': 'people who may be similar to their customers'}, {'target': 'Region', 'segment': 'the United States'}]
1 y [{'target': 'NAge', 'segment': '18 and older'}, {'target': 'Location Type', 'segment': 'HOME'}, {'target': 'Interest', 'segment': 'Hispanic culture'}, {'target': 'Interest', 'segment': 'Republican Party (United States)'}, {'target': 'Location Granularity', 'segment': 'country'}, {'target': 'Country', 'segment': 'the United States'}, {'target': 'MinAge', 'segment': 18}]

# use explode to give each dict in a list a separate row
df = df.explode('col2').reset_index(drop=True)

# normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
df = df.join(pd.json_normalize(df.col2)).drop(columns=['col2'])

display(df)

   col1                target                                       segment
0 x NAge 21 and older
1 x MinAge 21
2 x Retargeting people who may be similar to their customers
3 x Region the United States
4 y NAge 18 and older
5 y Location Type HOME
6 y Interest Hispanic culture
7 y Interest Republican Party (United States)
8 y Location Granularity country
9 y Country the United States
10 y MinAge 18

Get count

  • If the goal is to get the count for each 'target' and associated 'segment'
counts = df.groupby(['target', 'segment']).count()

Updated

  • This update is implemented for the full file
import pandas as pd
from ast import literal_eval

# load the file
df = pd.read_csv('en-US.csv')

# replace NaNs with '[]', otherwise literal_eval will error
df.targets = df.targets.fillna('[]')

# replace null with None, otherwise literal_eval will error
df.targets = df.targets.str.replace('null', 'None')

# convert the strings to lists of dicts
df.targets = df.targets.apply(literal_eval)

# use explode to give each dict in a list a separate row
df = df.explode('targets').reset_index(drop=True)

# fillna with {} is required for json_normalize
df.targets = df.targets.fillna({i: {} for i in df.index})

# normalize the column of dicts, join back to the remaining dataframe columns, and drop the unneeded column
normalized = pd.json_normalize(df.targets)

# get the counts
counts = normalized.groupby(['target', 'segment']).segment.count().reset_index(name='counts')

Split/Parse Values in One Column and create multiple Columns in Python

Hope this will give you the solution you want.

Original Data:

df = pd.DataFrame({'A': ['Order ID:0001ACW120I .Record ID:01160000000UAxCCW .Type:Small .Amount:4596.35  .Booked Date 2021-06-14']})

Replacing . with : & then splitting with :

df = df['A'].replace(to_replace ='\s[.]', value = ':', regex = True).str.split(':', expand = True)

Final dataset. Rename the columns.

print(df)


Related Topics



Leave a reply



Submit