What Is the Most Efficient Way to Loop Through Dataframes with Pandas

What is the most efficient way to loop through dataframes with pandas?

The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():

    # do some logic here

Or, if you want it faster use itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

best way to iterate through elements of pandas Series

TL;DR

Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.

However if Series iteration is absolutely necessary, performance will depend on the dtype and index:

Index	Fastest if numpy dtype	Fastest if pandas dtype	Idiomatic
^Unneeded	^{in s.to_numpy()}	^{in s.array}	^{in s}
^Default	^{in enumerate(s.to_numpy())}	^{in enumerate(s.array)}	^{in s.items()}
^Custom	^{in zip(s.index, s.to_numpy())}	^{in s.items()}	^{in s.items()}

Faster way to iterate through Pandas Dataframe?

Use str.extract with a regex pattern to avoid a loop:

import re

pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
                                .fillna('fruit not found')

Output:

>>> df
    fruit_source  value            fruit
0     Apple farm     10            Apple
1   Banana field     15           Banana
2  Coconut beach     14          Coconut
3     corn field     10  fruit not found

>>> pattern
'(apple|banana|coconut)'

How to iterate over rows in a DataFrame in Pandas

DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index()  # make sure indexes pair with number of rows

for index, row in df.iterrows():
    print(row['c1'], row['c2'])

10 100
11 110
12 120

Most efficient way to loop through and update rows in a large pandas dataframe

I believe here is possible use:

#thanks @Chris A for another solution
t = pd.to_datetime(df['timestamp'], unit='ms')

t = pd.to_datetime(df['timestamp'].astype(int) / 1000)
#alternative
#t = pd.to_datetime(df['timestamp'].apply(int) / 1000)
#t = pd.to_datetime([int(x) / 1000 for x in df['timestamp']] )

df['Time'] = t.dt.strftime('%H:%M:%S')
df['Hour'] = t.dt.hour
df['ChatDate'] = t.dt.strftime('%d-%m-%Y')

Efficient way to iterate through a large dataframe

Let's forget a moment about potential issues with methodology (think about how your results would look if 100k shares traded at a price of 50-51 and 100k traded at 50-59).

Below are a set of commented steps that should achieve your goal:

# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30], 
                   'high': [20, 22, 47, 39], 
                   'volume': [45667, 256565, 45645, 547343]})

# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}

# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
    price_dict[price] += volume

# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
      for low, high, volume in zip(df.low, df.high, df.volume)]

# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')

>>> df
       total_volume_traded
price                     
10                   45667
11                   45667
12                   45667
13                   45667
14                   45667
15                  302232
16                  302232
17                  302232
18                  302232
19                  302232
20                  302232
21                  256565
22                  256565
23                       0
24                       0
25                       0
26                       0
27                       0
28                       0
29                       0
30                  547343
31                  547343
32                  547343
33                  547343
34                  547343
35                  547343
36                  547343
37                  547343
38                  547343
39                  547343
40                       0
41                   45645
42                   45645
43                   45645
44                   45645
45                   45645
46                   45645
47                   45645

What Is the Most Efficient Way to Loop Through Dataframes with Pandas