Fast Way to Split Column into Multiple Rows in Pandas

Fast way to split column into multiple rows in Pandas

TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this:

def create(n):
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"],
'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df = pd.concat([df]*n)
df = df.reset_index(drop=True)
return df

def orig(df):
s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1)
s.name = "Genes"
del df["gene"]
return df.join(s)

def faster(df):
s = df["gene"].str.split(' // ', expand=True).stack()
i = s.index.get_level_values(0)
df2 = df.loc[i].copy()
df2["gene"] = s.values
return df2

which gives me

>>> df = create(1)
>>> df
gene cell1 cell2
0 foo 5 12
1 bar // lal 9 90
2 qux 1 13
3 woz 7 87
>>> %time orig(df.copy())
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.2 ms
cell1 cell2 Genes
0 5 12 foo
1 9 90 bar
1 9 90 lal
2 1 13 qux
3 7 87 woz
>>> %time faster(df.copy())
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 12.4 ms
gene cell1 cell2
0 foo 5 12
1 bar 9 90
1 lal 9 90
2 qux 1 13
3 woz 7 87

for comparable speeds at low sizes, and

>>> df = create(10000)
>>> %timeit z = orig(df.copy())
1 loops, best of 3: 14.2 s per loop
>>> %timeit z = faster(df.copy())
1 loops, best of 3: 231 ms per loop

a 60-fold speedup in the larger case. Note that the only reason I'm using df.copy() here is because orig is destructive.

Split pandas dataframe rows into multiple rows

IIUC, group the IDs by chunks of 2 using a list comprehension, then explode the two IDs/distance columns:

df['IDs'] = [[l[i:i+2] for i in range(0,len(l),2)] for l in df['IDs']]
df = df.explode(['IDs', 'distance'])

NB. this requires len(IDs) to be 2 times len(distance) for each row!

output:

                        IDs distance
2022-01-01 12:00:00 [A, B] 1
2022-01-01 12:00:01 [A, B] 1.1
2022-01-01 12:00:01 [A, C] 2.8
2022-01-01 12:00:02 [A, B] 1
2022-01-01 12:00:02 [A, D] 3
2022-01-01 12:00:02 [C, D] 0.5

How do I Split a DataFrame Row into Multiple Rows?

Since no columns are defined, append() assumes each VarX is a new row (i.e. single column). You need to first create a dataframe with the relevant number of columns and then append.

df = pd.DataFrame(columns=[1,2,3,4])
Var1 = "X"
Var2 = 300
Var3 = Var2*15
Var4 = Var3*0.75
df = df.append({1:Var1, 2:Var2, 3:Var3, 4:Var4}, ignore_index=True)

Var5 = "Y"
Var6 = 650
Var7 = Var2*20
df = df.append({1:Var5, 2:Var6, 3:Var7}, ignore_index=True)

Output:

   1    2     3       4
0 X 300 4500 3375.0
1 Y 650 6000 NaN

Split (explode) pandas dataframe string entry to separate rows

How about something like this:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2

Then you just have to rename the columns

Split column into multiple columns when a row starts with a string

try this:

pd.concat([sub.reset_index(drop=True) for _, sub in df.groupby(
df.Group.str.contains(r'^Group\s+123').cumsum())], axis=1)
>>>

Group Group Group
0 Group 123 nv-1 Group 123 mt-d2 Group 123 id-01
1 a, v b, v n,m
2 s,b NaN x, y
3 y, i NaN z, m
4 NaN NaN l,b


Related Topics



Leave a reply



Submit