How to apply a function on every row on a dataframe?
The following should work:
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
ch=0.2
ck=5
df['Q'] = df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
df
If all you're doing is calculating the square root of some result then use the np.sqrt
method this is vectorised and will be significantly faster:
In [80]:
df['Q'] = np.sqrt((2*df['D']*ck)/(ch*df['p']))
df
Out[80]:
D p Q
0 10 20 5.000000
1 20 30 5.773503
2 30 10 12.247449
Timings
For a 30k row df:
In [92]:
import math
ch=0.2
ck=5
def EOQ(D,p,ck,ch):
Q = math.sqrt((2*D*ck)/(ch*p))
return Q
%timeit np.sqrt((2*df['D']*ck)/(ch*df['p']))
%timeit df.apply(lambda row: EOQ(row['D'], row['p'], ck, ch), axis=1)
1000 loops, best of 3: 622 µs per loop
1 loops, best of 3: 1.19 s per loop
You can see that the np method is ~1900 X faster
How to apply a function on all rows of a DataFrame
Use pandas apply with axis = 1. It will apply the function to each row and return a series.
df['new'] = df.apply(igroups, axis = 1)
Apply function row wise to pandas dataframe
Since you have hilbert(df.col_1, df.col_2)
in the apply, that's immediately trying to call your function with the full pd.Series
es for those two columns, triggering that error. What you should be doing is:
df.apply(lambda x: hilbert(x['col_1'], x['col_2']), axis=1)
so that the lambda function given will be applied to each row.
Apply Function to Every Row in pandas DataFrame Using Multiple Column Values as Parameters
You can do this without changing your example_hash()
method:
Just use np.vectorize
In [2204]: import numpy as np
In [2200]: def example_hash(name: str, age: int) -> str:
...: return "In 10 years {} will be {}".format(name, age+10)
...:
In [2202]: df['new'] = np.vectorize(example_hash)(df['name'], df['age'])
In [2203]: df
Out[2203]:
name age height new
0 Bob 20 2.0 In 10 years Bob will be 30
1 Alice 40 2.1 In 10 years Alice will be 50
OR use df.apply
with lambda
like this without changing your custom method:
In [2207]: df['new'] = df.apply(lambda x: example_hash(x['name'], x['age']), axis=1)
In [2208]: df
Out[2208]:
name age height new
0 Bob 20 2.0 In 10 years Bob will be 30
1 Alice 40 2.1 In 10 years Alice will be 50
Applying a function to every row in a dataframe column in Pandas
When you're working with vanilla strings, you call the functions directly. When working with pandas columns directly, use the str
accessor methods.
Case 1
As mentioned in my comment, use the str
methods:
df
Text
0 I am me
1 I am not you
2 I will be him
df['Text'] = df['Text'].str.split().str[:-1].str.join(' ')
Text
0 I am
1 I am not
2 I will be
Case 2
Alternatively, when working with apply
on a single column, the lambda
receives a string, (not a pd.Series
), so .str
accessor methods aren't involved.
Apply function to each DataFrame row, without returning a Series
This operation can already be directly vectorized by-row, so you can avoid using .apply()
, which will be tremendously faster
Canonical Answer for How to iterate over rows in a DataFrame in Pandas?
You won't be able to avoid using memory for the results because they need to go somewhere, but you could throw out columns you no longer need before or after performing the calculation
Just keeping the results in a dataframe column (Series) rather than a list of native ints will be a memory savings, but you may find that explicitly setting or reducing the datatypes of your dataframe is a big savings if they're not in their most efficient types already (for example from int64
to uint16
or even uint8
(which will still contain the example values)
>>> df = pd.DataFrame({"col1": [2,10], "col2": [3,12], "col3": [5,4]})
>>> df
col1 col2 col3
0 2 3 5
1 10 12 4
>>> df["2xy"] = 2 * df["col2"] * df["col3"]
>>> df
col1 col2 col3 2xy
0 2 3 5 30
1 10 12 4 96
How to apply a function to every element in a dataframe?
Since your problem requires access to both the index and column labels of your df
you probably want df.apply()
.
df.apply()
has access to a pandas.Series
representing each row/column (dependent on axis
argument value) and you will have access to the column name and index; whereas df.applymap()
utilises each individual value of df
at runtime - so you wouldn't necessarily have access to the index and column name as required.
Example
import numpy as np
import pandas as pd
def foo(name, index):
return name - index
x = np.arange(0, 2.01, 0.25)
y = np.arange(10, 30, 5.0)
df = pd.DataFrame(index = x, columns = y)
df.apply(lambda x: foo(x.name, x.index))
Output
10.0 15.0 20.0 25.0
0.00 10.00 15.00 20.00 25.00
0.25 9.75 14.75 19.75 24.75
0.50 9.50 14.50 19.50 24.50
0.75 9.25 14.25 19.25 24.25
1.00 9.00 14.00 19.00 24.00
1.25 8.75 13.75 18.75 23.75
1.50 8.50 13.50 18.50 23.50
1.75 8.25 13.25 18.25 23.25
2.00 8.00 13.00 18.00 23.00
In the above example the column name and index of each Series constituting df
is passed to foo()
by way of df.apply()
. Within foo()
each value is defined by subtracting it's own index value from it's own column name value. Here you can see that the index value for each row is accessed using x.index
and the column value is accessed using x.name
within the call within df.apply()
.
Update
Many thanks to @SyntaxError for pointing out that x.index
and x.name
could be passed to foo()
within df.apply()
instead of feeding the entire Series (x
) into the function and accessing the values manually therein. As mentioned, this seems to fit OP's use case in a much neater manner than my original response - which was largely the same but passed each x
series into foo()
which then had responsibility for extracting x.name
and x.column
.
R apply() custom function to every row in data frame
Another approach is modifying your existing function such that it is vectorised.
t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
{
if(!equal.variance)
{
se <- sqrt( (s1^2/n1) + (s2^2/n2) )
# welch-satterthwaite df
df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )
} else
{
# pooled standard deviation, scaled by the sample sizes
se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) )
df <- n1+n2-2
}
t <- (m1-m2-m0)/se
dat <- vapply(seq_len(length(m1)),
function(x){c(m1[x]-m2[x], se[x], t[x], 2*pt(-abs(t[x]),df[x]))},
numeric(4)) #one tailed m2 > m1. Replace with "2*pt(-abs(t),df))" for two tailed.
dat <- t(dat)
dat <- as.data.frame(dat)
names(dat) <- c("Difference of means", "Std Error", "t", "p-value")
return(dat)
}
This approach allows you to pass in vectors for your various inputs and it will provide a data frame of equal length to your inputs. It uses the vapply
function to return a vector of length 4 for each value provided.
Under this approach, you can simply go
t.test2(MPAmeans$reference_mean, MPAmeans$MPA_mean, MPAmeans$sd_reference, MPAmeans$sd_MPA, MPAmeans$n_reference, MPAmeans$n_MPA)
(or whatever you end up calling your variables)
Related Topics
How to Import Data from Mongodb to Pandas
How to Break Up This Long Line in Python
Numpy 'Smart' Symmetric Matrix
Python: Changing Methods and Attributes at Runtime
How to Concatenate Two Layers in Keras
Scikit-Learn Dbscan Memory Usage
Difference Between Type(Obj) and Obj._Class_
How to Check If a Column Exists in Pandas
Running an Interactive Command from Within Python
How to Compare Dates in Django Templates
How to Escape Latex Code Received Through User Input
How to Force a List to a Fixed Size
Why Is the Borg Pattern Better Than the Singleton Pattern in Python
Why Is the Value of _Name_ Changing After Assignment to Sys.Modules[_Name_]