How to Split a Column of Tuples in a Pandas Dataframe

How can I split a column of tuples in a Pandas dataframe?

You can do this by doing pd.DataFrame(col.tolist()) on that column:

In [2]: df = pd.DataFrame({'a':[1,2], 'b':[(1,2), (3,4)]})

In [3]: df
Out[3]:
a b
0 1 (1, 2)
1 2 (3, 4)

In [4]: df['b'].tolist()
Out[4]: [(1, 2), (3, 4)]

In [5]: pd.DataFrame(df['b'].tolist(), index=df.index)
Out[5]:
0 1
0 1 2
1 3 4

In [6]: df[['b1', 'b2']] = pd.DataFrame(df['b'].tolist(), index=df.index)

In [7]: df
Out[7]:
a b b1 b2
0 1 (1, 2) 1 2
1 2 (3, 4) 3 4

Note: in an earlier version, this answer recommended to use df['b'].apply(pd.Series) instead of pd.DataFrame(df['b'].tolist(), index=df.index). That works as well (because it makes a Series of each tuple, which is then seen as a row of a dataframe), but it is slower / uses more memory than the tolist version, as noted by the other answers here (thanks to denfromufa).

Split tuples columns in pandas dataframe

You can use:

df = pd.DataFrame(data={0: ['Neck', 'RShoulder', 'LShoulder', 'RElbow', 'RWrist', 'LElbow'],
1: [None, None, (840, 183), None, None, (936,255)]})

df[['new_col_1', 'new_col_2']] = df[1].apply(pd.Series)

Output:

           0           1  new_col_1  new_col_2
0 Neck None NaN NaN
1 RShoulder None NaN NaN
2 LShoulder (840, 183) 840.0 183.0
3 RElbow None NaN NaN
4 RWrist None NaN NaN
5 LElbow (936, 255) 936.0 255.0

Pandas splitting Columns and creating Columns of tuples

Code

def merge(row):
return pd.Series({
"colAA": (row.colB, row.colC),
"colBB": (row.colC, row.colA),
})

df['colB'] = df['colB'].str.split(';')
df = df.explode('colB')
newDf = df.apply(merge, axis=1).reset_index(drop=True)

Explanation

You can split colB to get list of values,
Then apply explode function to get multiple rows

df['colB'] = df['colB'].str.split(';')
df = df.explode('colB')

# output
colA colB colC
0 rqp 129 a
1 pot 217 u
1 pot 345 u
2 ghay 716 b
3 rbba 217 d

Then apply merge function below to create new data frame

def merge(row):
for b in row.colB.split(";"):
return pd.Series({
"colAA": (b, row.colC),
"colBB": (row.colC, row.colA),

})

Then apply this function on Df

newDf = df.apply(merge, axis=1).reset_index(drop=True)

# output
colAA colBB
0 (129, a) (a, rqp)
1 (217, u) (u, pot)
2 (345, u) (u, pot)
3 (716, b) (b, ghay)
4 (217, d) (d, rbba)
5 (345, d) (d, rbba)
6 (612, a) (a, tary)
7 (811, a) (a, tary)
8 (760, a) (a, tary)
9 (716, t) (t, kals)

Splitting strings of tuples of different lengths to columns in Pandas DF

You can do it this way. It will just put None in places where it couldn't find the values. You can then append the df1 to df.

d = {'id': [1,2,3], 
'human_id': ["('apples', '2022-12-04', 'a5ted')",
"('bananas', '2012-2-14')",
"('2012-2-14', 'reda21', 'ss')"
]}

df = pd.DataFrame(data=d)

list_human_id = tuple(list(df['human_id']))

newList = []
for val in listh:
newList.append(eval(val))

df1 = pd.DataFrame(newList, columns=['col1', 'col2', 'col3'])

print(df1)

Output

col1 col2 col3
0 apples 2022-12-04 a5ted
1 bananas 2012-2-14 None
2 2012-2-14 reda21 ss

Pandas: How to split a column of string of multiple tuples to multiple columns of individual string of tuple

You can use str.extract() with regex, as follows:

df['data'].str.extract(r'(\(\d+,\s*\d+\))\s*,\s*(\(\d+,\s*\d+\))')

or use str.split(), as follows:

df['data'].str.split(r'(?<=\))\s*,\s*', expand=True)

Here we use regex positive lookbehind to look for a closing parenthesis ) before comma , for the comma to match. Hence, we only split on the comma between tuples and not within tuples.

Result:

       0      1
0 (0,1) (1,2)

Split a Pandas column with lists of tuples into separate columns

Try explode followed by apply ( pd.Series ) then merge back to the DataFrame:

import pandas as pd

df = pd.DataFrame({'ID': ['A', 'B', 'C'],
'col': [[('123', '456', '111', False),
('124', '456', '111', True),
('125', '456', '111', False)],
[],
[('123', '555', '333', True)]]
})
# Explode into Rows
new_df = df.explode('col').reset_index(drop=True)

# Merge Back Together
new_df = new_df.merge(
# Turn into Multiple Columns
new_df['col'].apply(pd.Series),
left_index=True,
right_index=True) \
.drop(columns=['col']) # Drop Old Col Column

# Rename Columns
new_df.columns = ['ID', 'col1', 'col2', 'col3', 'col4']

# For Display
print(new_df)

Output:

  ID col1 col2 col3   col4
0 A 123 456 111 False
1 A 124 456 111 True
2 A 125 456 111 False
3 B NaN NaN NaN NaN
4 C 123 555 333 True

How to split tuples in all columns of a dataframe

Considering the following toy dataframe:

import pandas as pd

df = pd.DataFrame(
{
0: {
0: None,
1: None,
2: None,
3: ("bartenbach gmbh rinner strasse 14 aldrans", 96, 1050),
4: (
"ait austrian institute of technology gmbh giefinggasse 4 wien",
70,
537,
),
},
1: {0: None, 1: None, 2: None, 3: None, 4: None},
2: {0: None, 1: None, 2: None, 3: None, 4: None},
}
)

print(df)
# Outputs
0 1 2
0 None None None
1 None None None
2 None None None
3 (bartenbach gmbh rinner strasse 14 aldrans, 96... None None
4 (ait austrian institute of technology gmbh gie... None None

You could iterate on each column, then each value, split the string and populate a new dataframe, like this:

new_df = pd.DataFrame()

for col_num, series in df.iteritems():
for i, value in enumerate(series.values):
try:
name, score, id_num = value
new_df.loc[i, f"Name{col_num}"] = name
new_df.loc[i, f"Score{col_num}"] = score
new_df.loc[i, f"ID{col_num}"] = id_num
except TypeError:
continue
new_df = new_df.reset_index(drop=True)

print(new_df)
# Outputs
Name0 Score0 ID0
0 bartenbach gmbh rinner strasse 14 aldrans 96.0 1050.0
1 ait austrian institute of technology gmbh gief... 70.0 537.0

How can I split pandas dataframe into column of tuple, quickly?

One idea is use list comprehension:

s = pd.Series('a_1, a_2, a_3, b_1'.split(', '))
#4k rows
s = pd.concat([s] * 1000, ignore_index=True)

In [195]: %timeit s.str.split("_").apply(tuple)
2.49 ms ± 41.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [196]: %timeit [tuple(x.split('_')) for x in s]
1.46 ms ± 79.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [197]: %timeit pd.Index(s).str.split("_", expand=True).tolist()
4.31 ms ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


s = pd.Series('a_1, a_2, a_3, b_1'.split(', '))
#400k rows
s = pd.concat([s] * 100000, ignore_index=True)

In [199]: %timeit s.str.split("_").apply(tuple)
252 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [200]: %timeit [tuple(x.split('_')) for x in s]
180 ms ± 370 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [201]: %timeit pd.Index(s).str.split("_", expand=True).tolist()
379 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

split several columns with tuples into separate columns

Here's a solution:

new_df = pd.concat([pd.DataFrame(spl[c].tolist()).add_prefix(c[-1]) for c in spl], axis=1)
new_df.columns = pd.MultiIndex.from_arrays([np.repeat(spl.columns.get_level_values(0), 2), new_df.columns])

Output:

>>> new_df
a e
b0 b1 c0 c1 b0 b1
0 0 1 0 1 0 1
1 1 2 2 3 2 3
2 2 3 4 5 4 5

One-big-liner :)

new_df = pd.concat([pd.DataFrame(spl[c].tolist()).add_prefix(c[-1]) for c in spl], axis=1).pipe(lambda x: x.set_axis(pd.MultiIndex.from_arrays([np.repeat(spl.columns.get_level_values(0), 2), x.columns]), axis=1))


Related Topics



Leave a reply



Submit