How to Apply a Function on Every Row on a Dataframe

How to apply a function on every row on a dataframe?

The following should work:

def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
ch=0.2
ck=5
df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
df

If all you're doing is calculating the square root of some result then use the np.sqrt method this is vectorised and will be significantly faster:

In [80]:
df['Q'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))

df
Out[80]:
D p Q
0 10 20 5.000000
1 20 30 5.773503
2 30 10 12.247449

Timings

For a 30k row df:

In [92]:

import math
ch=0.2
ck=5
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q

%timeit np.sqrt((2*df['D']*ck)/(ch*df['p']))
%timeit df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
1000 loops, best of 3: 622 µs per loop
1 loops, best of 3: 1.19 s per loop

You can see that the np method is ~1900 X faster

How to apply a function on all rows of a DataFrame

Use pandas apply with axis = 1. It will apply the function to each row and return a series.

df['new'] = df.apply(igroups, axis = 1)

Apply function row wise to pandas dataframe

Since you have hilbert(df.col_1, df.col_2) in the apply, that's immediately trying to call your function with the full pd.Serieses for those two columns, triggering that error. What you should be doing is:

df.apply(lambda x: hilbert(x['col_1'], x['col_2']), axis=1)

so that the lambda function given will be applied to each row.

Apply Function to Every Row in pandas DataFrame Using Multiple Column Values as Parameters

You can do this without changing your example_hash() method:

Just use np.vectorize

In [2204]: import numpy as np 

In [2200]: def example_hash(name: str, age: int) -> str:
...: return "In 10 years {} will be {}".format(name, age+10)
...:
In [2202]: df['new'] = np.vectorize(example_hash)(df['name'], df['age'])

In [2203]: df
Out[2203]:
name age height new
0 Bob 20 2.0 In 10 years Bob will be 30
1 Alice 40 2.1 In 10 years Alice will be 50

OR use df.apply with lambda like this without changing your custom method:

In [2207]: df['new'] = df.apply(lambda x: example_hash(x['name'], x['age']), axis=1)                                                                                                                        

In [2208]: df
Out[2208]:
name age height new
0 Bob 20 2.0 In 10 years Bob will be 30
1 Alice 40 2.1 In 10 years Alice will be 50

Applying a function to every row in a dataframe column in Pandas

When you're working with vanilla strings, you call the functions directly. When working with pandas columns directly, use the str accessor methods.

Case 1
As mentioned in my comment, use the str methods:

df

Text
0 I am me
1 I am not you
2 I will be him

df['Text'] = df['Text'].str.split().str[:-1].str.join(' ')

Text
0 I am
1 I am not
2 I will be

Case 2
Alternatively, when working with apply on a single column, the lambda receives a string, (not a pd.Series), so .str accessor methods aren't involved.

Apply function to each DataFrame row, without returning a Series

This operation can already be directly vectorized by-row, so you can avoid using .apply(), which will be tremendously faster

Canonical Answer for How to iterate over rows in a DataFrame in Pandas?

You won't be able to avoid using memory for the results because they need to go somewhere, but you could throw out columns you no longer need before or after performing the calculation

Just keeping the results in a dataframe column (Series) rather than a list of native ints will be a memory savings, but you may find that explicitly setting or reducing the datatypes of your dataframe is a big savings if they're not in their most efficient types already (for example from int64 to uint16 or even uint8 (which will still contain the example values)

>>> df = pd.DataFrame({"col1": [2,10], "col2": [3,12], "col3": [5,4]})
>>> df
col1 col2 col3
0 2 3 5
1 10 12 4
>>> df["2xy"] = 2 * df["col2"] * df["col3"]
>>> df
col1 col2 col3 2xy
0 2 3 5 30
1 10 12 4 96

How to apply a function to every element in a dataframe?

Since your problem requires access to both the index and column labels of your df you probably want df.apply().

df.apply() has access to a pandas.Series representing each row/column (dependent on axis argument value) and you will have access to the column name and index; whereas df.applymap() utilises each individual value of df at runtime - so you wouldn't necessarily have access to the index and column name as required.

Example

import numpy as np
import pandas as pd

def foo(name, index):
return name - index

x = np.arange(0, 2.01, 0.25)
y = np.arange(10, 30, 5.0)

df = pd.DataFrame(index = x, columns = y)

df.apply(lambda x: foo(x.name, x.index))

Output

       10.0   15.0   20.0   25.0
0.00 10.00 15.00 20.00 25.00
0.25 9.75 14.75 19.75 24.75
0.50 9.50 14.50 19.50 24.50
0.75 9.25 14.25 19.25 24.25
1.00 9.00 14.00 19.00 24.00
1.25 8.75 13.75 18.75 23.75
1.50 8.50 13.50 18.50 23.50
1.75 8.25 13.25 18.25 23.25
2.00 8.00 13.00 18.00 23.00

In the above example the column name and index of each Series constituting df is passed to foo() by way of df.apply(). Within foo() each value is defined by subtracting it's own index value from it's own column name value. Here you can see that the index value for each row is accessed using x.index and the column value is accessed using x.name within the call within df.apply().

Update

Many thanks to @SyntaxError for pointing out that x.index and x.name could be passed to foo() within df.apply() instead of feeding the entire Series (x) into the function and accessing the values manually therein. As mentioned, this seems to fit OP's use case in a much neater manner than my original response - which was largely the same but passed each x series into foo() which then had responsibility for extracting x.name and x.column.

R apply() custom function to every row in data frame

Another approach is modifying your existing function such that it is vectorised.

    t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
{
if(!equal.variance)
{
se <- sqrt( (s1^2/n1) + (s2^2/n2) )
# welch-satterthwaite df
df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )
} else
{
# pooled standard deviation, scaled by the sample sizes
se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) )
df <- n1+n2-2
}
t <- (m1-m2-m0)/se
dat <- vapply(seq_len(length(m1)),
function(x){c(m1[x]-m2[x], se[x], t[x], 2*pt(-abs(t[x]),df[x]))},
numeric(4)) #one tailed m2 > m1. Replace with "2*pt(-abs(t),df))" for two tailed.
dat <- t(dat)
dat <- as.data.frame(dat)
names(dat) <- c("Difference of means", "Std Error", "t", "p-value")
return(dat)
}

This approach allows you to pass in vectors for your various inputs and it will provide a data frame of equal length to your inputs. It uses the vapply function to return a vector of length 4 for each value provided.

Under this approach, you can simply go

t.test2(MPAmeans$reference_mean, MPAmeans$MPA_mean, MPAmeans$sd_reference, MPAmeans$sd_MPA, MPAmeans$n_reference, MPAmeans$n_MPA)

(or whatever you end up calling your variables)



Related Topics



Leave a reply



Submit