Fast Way to Split Column into Multiple Rows in Pandas

Fast way to split column into multiple rows in Pandas

TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this:

def create(n):
    df = pd.DataFrame({ 'gene':["foo",
                                "bar // lal",
                                "qux",
                                "woz"], 
                        'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
    df = df[["gene","cell1","cell2"]]
    df = pd.concat([df]*n)
    df = df.reset_index(drop=True)
    return df

def orig(df):
    s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
    s.index = s.index.droplevel(-1)
    s.name = "Genes"
    del df["gene"]
    return df.join(s)

def faster(df):
    s = df["gene"].str.split(' // ', expand=True).stack()
    i = s.index.get_level_values(0)
    df2 = df.loc[i].copy()
    df2["gene"] = s.values
    return df2

which gives me

>>> df = create(1)
>>> df
         gene  cell1  cell2
0         foo      5     12
1  bar // lal      9     90
2         qux      1     13
3         woz      7     87
>>> %time orig(df.copy())
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.2 ms
   cell1  cell2 Genes
0      5     12   foo
1      9     90   bar
1      9     90   lal
2      1     13   qux
3      7     87   woz
>>> %time faster(df.copy())
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 12.4 ms
  gene  cell1  cell2
0  foo      5     12
1  bar      9     90
1  lal      9     90
2  qux      1     13
3  woz      7     87

for comparable speeds at low sizes, and

>>> df = create(10000)
>>> %timeit z = orig(df.copy())
1 loops, best of 3: 14.2 s per loop
>>> %timeit z = faster(df.copy())
1 loops, best of 3: 231 ms per loop

a 60-fold speedup in the larger case. Note that the only reason I'm using df.copy() here is because orig is destructive.

Split pandas dataframe rows into multiple rows

IIUC, group the IDs by chunks of 2 using a list comprehension, then explode the two IDs/distance columns:

df['IDs'] = [[l[i:i+2] for i in range(0,len(l),2)] for l in df['IDs']]
df = df.explode(['IDs', 'distance'])

NB. this requires len(IDs) to be 2 times len(distance) for each row!

output:

                        IDs distance
2022-01-01 12:00:00  [A, B]        1
2022-01-01 12:00:01  [A, B]      1.1
2022-01-01 12:00:01  [A, C]      2.8
2022-01-01 12:00:02  [A, B]        1
2022-01-01 12:00:02  [A, D]        3
2022-01-01 12:00:02  [C, D]      0.5

How do I Split a DataFrame Row into Multiple Rows?

Since no columns are defined, append() assumes each VarX is a new row (i.e. single column). You need to first create a dataframe with the relevant number of columns and then append.

df = pd.DataFrame(columns=[1,2,3,4])
Var1 = "X"
Var2 = 300
Var3 = Var2*15
Var4 = Var3*0.75
df = df.append({1:Var1, 2:Var2, 3:Var3, 4:Var4}, ignore_index=True)

Var5 = "Y"
Var6 = 650
Var7 = Var2*20
df = df.append({1:Var5, 2:Var6, 3:Var7}, ignore_index=True)

Output:

   1    2     3       4
0  X  300  4500  3375.0
1  Y  650  6000     NaN

Split (explode) pandas dataframe string entry to separate rows

How about something like this:

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0
0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

Then you just have to rename the columns

Split column into multiple columns when a row starts with a string

try this:

pd.concat([sub.reset_index(drop=True) for _, sub in df.groupby(
    df.Group.str.contains(r'^Group\s+123').cumsum())], axis=1)
>>>

    Group           Group           Group
0   Group 123 nv-1  Group 123 mt-d2 Group 123 id-01
1   a, v            b, v            n,m
2   s,b             NaN             x, y
3   y, i            NaN             z, m
4   NaN             NaN             l,b

Fast Way to Split Column into Multiple Rows in Pandas