Fast way to split column into multiple rows in Pandas
TBH I think we need a fast built-in way of normalizing elements like this.. although since I've been out of the loop for a bit for all I know there is one by now, and I just don't know it. :-) In the meantime I've been using methods like this:
def create(n):
df = pd.DataFrame({ 'gene':["foo",
"bar // lal",
"qux",
"woz"],
'cell1':[5,9,1,7], 'cell2':[12,90,13,87]})
df = df[["gene","cell1","cell2"]]
df = pd.concat([df]*n)
df = df.reset_index(drop=True)
return df
def orig(df):
s = df["gene"].str.split(' // ').apply(pd.Series,1).stack()
s.index = s.index.droplevel(-1)
s.name = "Genes"
del df["gene"]
return df.join(s)
def faster(df):
s = df["gene"].str.split(' // ', expand=True).stack()
i = s.index.get_level_values(0)
df2 = df.loc[i].copy()
df2["gene"] = s.values
return df2
which gives me
>>> df = create(1)
>>> df
gene cell1 cell2
0 foo 5 12
1 bar // lal 9 90
2 qux 1 13
3 woz 7 87
>>> %time orig(df.copy())
CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.2 ms
cell1 cell2 Genes
0 5 12 foo
1 9 90 bar
1 9 90 lal
2 1 13 qux
3 7 87 woz
>>> %time faster(df.copy())
CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 12.4 ms
gene cell1 cell2
0 foo 5 12
1 bar 9 90
1 lal 9 90
2 qux 1 13
3 woz 7 87
for comparable speeds at low sizes, and
>>> df = create(10000)
>>> %timeit z = orig(df.copy())
1 loops, best of 3: 14.2 s per loop
>>> %timeit z = faster(df.copy())
1 loops, best of 3: 231 ms per loop
a 60-fold speedup in the larger case. Note that the only reason I'm using df.copy()
here is because orig
is destructive.
Split pandas dataframe rows into multiple rows
IIUC, group the IDs by chunks of 2 using a list comprehension, then explode
the two IDs/distance columns:
df['IDs'] = [[l[i:i+2] for i in range(0,len(l),2)] for l in df['IDs']]
df = df.explode(['IDs', 'distance'])
NB. this requires len(IDs) to be 2 times len(distance) for each row!
output:
IDs distance
2022-01-01 12:00:00 [A, B] 1
2022-01-01 12:00:01 [A, B] 1.1
2022-01-01 12:00:01 [A, C] 2.8
2022-01-01 12:00:02 [A, B] 1
2022-01-01 12:00:02 [A, D] 3
2022-01-01 12:00:02 [C, D] 0.5
How do I Split a DataFrame Row into Multiple Rows?
Since no columns are defined, append() assumes each VarX is a new row (i.e. single column). You need to first create a dataframe with the relevant number of columns and then append.
df = pd.DataFrame(columns=[1,2,3,4])
Var1 = "X"
Var2 = 300
Var3 = Var2*15
Var4 = Var3*0.75
df = df.append({1:Var1, 2:Var2, 3:Var3, 4:Var4}, ignore_index=True)
Var5 = "Y"
Var6 = 650
Var7 = Var2*20
df = df.append({1:Var5, 2:Var6, 3:Var7}, ignore_index=True)
Output:
1 2 3 4
0 X 300 4500 3375.0
1 Y 650 6000 NaN
Split (explode) pandas dataframe string entry to separate rows
How about something like this:
In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))
for _, row in a.iterrows()]).reset_index()
Out[55]:
index 0
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
Then you just have to rename the columns
Split column into multiple columns when a row starts with a string
try this:
pd.concat([sub.reset_index(drop=True) for _, sub in df.groupby(
df.Group.str.contains(r'^Group\s+123').cumsum())], axis=1)
>>>
Group Group Group
0 Group 123 nv-1 Group 123 mt-d2 Group 123 id-01
1 a, v b, v n,m
2 s,b NaN x, y
3 y, i NaN z, m
4 NaN NaN l,b
Related Topics
How to Get Max Output from a While Loop
How to Find Rows of One Dataframe in Another Dataframe
Filter All Rows That Do Not Contain Letters (Alpha) in 'Pandas'
Can Anyone Explain Me What This Python 3 Command Do
How to Use Chrome Webdriver in Selenium to Download Files in Python
Python Xlrd Unsupported Format, or Corrupt File.
How to Find Duration Between Two Time Difference in Python Dataframe
Filenotfounderror: [Errno 2] No Such File or Directory
Calculating the Mean of Each Month by Year in Python
How to Get the Sum of a List of Numbers With Recursion
How to Do a Conditional Count After Groupby on a Pandas Dataframe
Finding Out Who Got the Highest Mark Among the Students
Easiest Way to Ignore Blank Lines When Reading a File in Python
How to Install Python Packages from the Tar.Gz File Without Using Pip Install
How to Count the Amount of Sentences in a Paragraph in Python