Apply Pandas Function to Column to Create Multiple New Columns

Apply pandas function to column to create multiple new columns?

Building off of user1827356 's answer, you can do the assignment in one pass using df.merge:

df.merge(df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1})), 
left_index=True, right_index=True)

textcol feature1 feature2
0 0.772692 1.772692 -0.227308
1 0.857210 1.857210 -0.142790
2 0.065639 1.065639 -0.934361
3 0.819160 1.819160 -0.180840
4 0.088212 1.088212 -0.911788

EDIT:
Please be aware of the huge memory consumption and low speed: https://ys-l.github.io/posts/2015/08/28/how-not-to-use-pandas-apply/ !

Pandas apply row-wise a function and create multiple new columns

To do this, you need to:

  1. Transpose df2 so that its columns are correct for concatenation
  2. Index it with the df1["sic"] column to get the correct rows
  3. Reset the index of the obtained rows of df2 using .reset_index(drop=True), so that the dataframes can be concatenated correctly. (This replaces the current index e.g. 5, 6, 3, 8, 12, 6 with a new one e.g. 0, 1, 2, 3, 4, 5 while keeping the actual values the same. This is so that pandas doesn't get confused while concatenating them)
  4. Concatenate the two dataframes

Note: I used a method based off of this to read in the dataframe, and it assumed that the columns of df2 were strings but the values of the sic column of df1 were ints. Therefore I used .astype(str) to get step 2 working. If this is not actually the case, you may need to remove the .astype(str).

Here is the single line of code to do these things:

merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)

Here is the full code I used:

from io import StringIO
import pandas as pd

df1 = pd.read_csv(StringIO("""
sic data1 data2 data3 data4 data5
5 0.90783598 0.84722083 0.47149924 0.98724123 0.50654476
6 0.53442684 0.59730371 0.92486887 0.61531646 0.62784041
3 0.56806423 0.09619383 0.33846097 0.71878313 0.96316724
8 0.86933042 0.64965755 0.94549745 0.08866519 0.92156389
12 0.651328 0.37193774 0.9679044 0.36898991 0.15161838
6 0.24555531 0.50195983 0.79114578 0.9290596 0.10672607
"""), sep="\t")
df2 = pd.read_csv(StringIO("""
1 2 3 4 5 6 7 8 9 10 11 12
c_bar 0.4955329 0.92970292 0.68049726 0.91325006 0.55578465 0.78056519 0.53954711 0.90335326 0.93986402 0.0204794 0.51575764 0.61144255
a1_bar 0.75781444 0.81052669 0.99910449 0.62181902 0.11797144 0.40031316 0.08561665 0.35296894 0.14445697 0.93799762 0.80641802 0.31379671
a2_bar 0.41432552 0.36313911 0.13091618 0.39251953 0.66249636 0.31221897 0.15988528 0.1620938 0.55143589 0.66571044 0.68198944 0.23806947
a3_bar 0.38918855 0.83689178 0.15838139 0.39943204 0.48615188 0.06299899 0.86343819 0.47975619 0.05300611 0.15080875 0.73088725 0.3500239
a4_bar 0.47201384 0.90874121 0.50417142 0.70047698 0.24820601 0.34302454 0.4650635 0.0992668 0.55142391 0.82947194 0.28251699 0.53170308
"""), sep="\t", index_col=[0])

merged = pd.concat([df1, df2.T.loc[df1["sic"].astype(str)].reset_index(drop=True)], axis=1)

print(merged)

which produces the output:

   sic     data1     data2     data3  ...    a1_bar    a2_bar    a3_bar    a4_bar
0 5 0.907836 0.847221 0.471499 ... 0.117971 0.662496 0.486152 0.248206
1 6 0.534427 0.597304 0.924869 ... 0.400313 0.312219 0.062999 0.343025
2 3 0.568064 0.096194 0.338461 ... 0.999104 0.130916 0.158381 0.504171
3 8 0.869330 0.649658 0.945497 ... 0.352969 0.162094 0.479756 0.099267
4 12 0.651328 0.371938 0.967904 ... 0.313797 0.238069 0.350024 0.531703
5 6 0.245555 0.501960 0.791146 ... 0.400313 0.312219 0.062999 0.343025

[6 rows x 11 columns]

Return multiple columns from pandas apply()

You can return a Series from the applied function that contains the new data, preventing the need to iterate three times. Passing axis=1 to the apply function applies the function sizes to each row of the dataframe, returning a series to add to a new dataframe. This series, s, contains the new values, as well as the original data.

def sizes(s):
s['size_kb'] = locale.format("%.1f", s['size'] / 1024.0, grouping=True) + ' KB'
s['size_mb'] = locale.format("%.1f", s['size'] / 1024.0 ** 2, grouping=True) + ' MB'
s['size_gb'] = locale.format("%.1f", s['size'] / 1024.0 ** 3, grouping=True) + ' GB'
return s

df_test = df_test.append(rows_list)
df_test = df_test.apply(sizes, axis=1)

Apply function on multiple columns and create new column based on condition

I first had to add the columns and fill them with zeros, then apply the function.

def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"


lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = 0

i=0
for item in df.columns[-5:]:
df[item]=df.apply(lambda x: conditions(x, column1=lst1[i], column2=lst2[i]) , axis=1)
i=i+1

pandas apply function to multiple columns with condition and create new columns

First is necessary convert strings repr of lists by ast.literal_eval to lists, then for chceck length remove casting to strings. If need one element lists instead scalars use [] in fruit[0] and fruit[1] and last change order of condition for len(fruit) == 1, also change len(fruit) > 3 to len(fruit) > 2 for match first row:

def fruits_vegetable(row):

fruit = ast.literal_eval(row['fruit_code'])
vege = ast.literal_eval(row['vegetable_code'])

if len(fruit) == 1 and len(vege) > 1: # write "all" in new_col_1
row['new_col_1'] = 'all'
elif len(fruit) > 2 and len(vege) == 1: # vegetable_code in new_col_1
row['new_col_1'] = vege
elif len(fruit) > 2 and len(vege) > 1: # write "all" in new_col_1
row['new_col_1'] = 'all'
elif len(fruit) == 2 and len(vege) >= 0:# fruit 1 new_col_1 & fruit 2 new_col_2
row['new_col_1'] = [fruit[0]]
row['new_col_2'] = [fruit[1]]
elif len(fruit) == 1: # fruit_code in new_col_1
row['new_col_1'] = fruit
return row

df = df.apply(fruits_vegetable, axis=1)


print (df)
ID date fruit_code new_col_1 new_col_2 supermarket \
0 1 2022-01-01 [100,99,300] all NaN xy
1 2 2022-01-01 [67,200,87] [5000] NaN z, m
2 3 2021-01-01 [100,5,300,78] all NaN wf, z
3 4 2020-01-01 [77] [77] NaN NaN
4 5 2022-15-01 [100,200,546,33] all NaN t, wf
5 6 2002-12-01 [64,2] [64] [2] k
6 7 2018-12-01 [5] all NaN p

supermarkt vegetable_code
0 NaN [1000,2000,3000]
1 NaN [5000]
2 NaN [7000,2000,3000]
3 wf [1000]
4 NaN [4000,2000,3000]
5 NaN [6000,8000,1000]
6 NaN [6000,8000,1000]

Pandas Apply Function That returns two new columns

Based on your latest error, you can avoid the error by returning the new columns as a Series

def myfunc1(row):
C = row['A'] + 10
D = row['A'] + 50
return pd.Series([C, D])

df[['C', 'D']] = df.apply(myfunc1 ,axis=1)

Applying function with multiple arguments to create a new pandas column

Alternatively, you can use numpy underlying function:

>>> import numpy as np
>>> df = pd.DataFrame({"A": [10,20,30], "B": [20, 30, 10]})
>>> df['new_column'] = np.multiply(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300

or vectorize arbitrary function in general case:

>>> def fx(x, y):
... return x*y
...
>>> df['new_column'] = np.vectorize(fx)(df['A'], df['B'])
>>> df
A B new_column
0 10 20 200
1 20 30 600
2 30 10 300

Apply pandas function to column to create multiple new columns error

In my opinion zip with apply is not recommended combine, for add multiple new columns is possible use:

df = pd.DataFrame([[i] for i in range(5)], columns=['num'])
def powers(x):

return pd.Series([x, x**2, x**3, x**4, x**5, x**6])
df[['p1','p2','p3','p4','p5','p6']] = df['num'].apply(powers)

print (df)
num p1 p2 p3 p4 p5 p6
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 4 8 16 32 64
3 3 3 9 27 81 243 729
4 4 4 16 64 256 1024 4096

For pass one column DataFrame is possible use:

df = pd.DataFrame([[i] for i in range(5)], columns=['num'])
def powers(x):

return [x, x**2, x**3, x**4, x**5, x**6]
df[['p1','p2','p3','p4','p5','p6']] = df[['num']].pipe(powers)

print (df)
num p1 p2 p3 p4 p5 p6
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 4 8 16 32 64
3 3 3 9 27 81 243 729
4 4 4 16 64 256 1024 4096

For multiple columns:

df = pd.DataFrame([[i] for i in range(5)], columns=['num'])
df['new'] = df['num'] * 2
def powers(x):

return [x, x**2, x**3, x**4, x**5, x**6]

df = pd.concat(df[['num','new']].pipe(powers), axis=1, keys=['p1','p2','p3','p4','p5','p6'])
df.columns = df.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df)
p1_num p1_new p2_num p2_new p3_num p3_new p4_num p4_new p5_num \
0 0 0 0 0 0 0 0 0 0
1 1 2 1 4 1 8 1 16 1
2 2 4 4 16 8 64 16 256 32
3 3 6 9 36 27 216 81 1296 243
4 4 8 16 64 64 512 256 4096 1024

p5_new p6_num p6_new
0 0 0 0
1 32 1 64
2 1024 64 4096
3 7776 729 46656
4 32768 4096 262144

Create multiple pandas DataFrame columns from applying a function with multiple returns

Adding pd.Series

df[['sum', 'difference']] = df.apply(
lambda row: pd.Series(add_subtract(row['a'], row['b'])), axis=1)
df

yields

   a  b  sum  difference
0 1 4 5 -3
1 2 5 7 -3
2 3 6 9 -3

how to create multiple columns at once with apply?

Returning a series is possible the most readable solution.

def myfunc(var):
return pd.Series([['small list'], var + 2, ['another list']])

df[['my small list', 'my numeric', 'other list']] = df.var1.apply(lambda x: myfunc(x))

However, for larger dataframes you should prefer either the zip or the dataframe approach.

import pandas as pd # 1.2.2
import perfplot

def setup(n):
return pd.DataFrame(dict(
var1=list(range(n))
))

def with_series(df):
def myfunc(var):
return pd.Series([['small list'], var + 2, ['other list']])
out = pd.DataFrame()
out[['small list', 'numeric', 'other list']] = df.var1.apply(lambda x: myfunc(x))

def with_zip(df):
def myfunc(var):
return [['small list'], var + 2, ['other list']]
out = pd.DataFrame()
out['small list'], out['numeric'], out['other list'] = list(zip(*df.var1.apply(lambda x: myfunc(x))))

def with_dataframe(df):
def myfunc(var):
return [['small list'], var + 2, ['other list']]
out = pd.DataFrame()
out[['small list', 'numeric', 'other list']] = pd.DataFrame(df.var1.apply(myfunc).to_list())

perfplot.show(
setup=setup,
kernels=[
with_series,
with_zip,
with_dataframe,
],
labels=["series", "zip", "df"],
n_range=[2 ** k for k in range(20)],
xlabel="len(df)",
equality_check=None,
)

Sample Image



Related Topics



Leave a reply



Submit