What Is the Most Efficient Way to Loop Through Dataframes with Pandas

What is the most efficient way to loop through dataframes with pandas?

The newest versions of pandas now include a built-in function for iterating over rows.

for index, row in df.iterrows():

# do some logic here

Or, if you want it faster use itertuples()

But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.

best way to iterate through elements of pandas Series

TL;DR

Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.

However if Series iteration is absolutely necessary, performance will depend on the dtype and index:































IndexFastest if numpy dtypeFastest if pandas dtypeIdiomatic
Unneededin s.to_numpy()in s.arrayin s
Defaultin enumerate(s.to_numpy())in enumerate(s.array)in s.items()
Customin zip(s.index, s.to_numpy())in s.items()in s.items()

Faster way to iterate through Pandas Dataframe?

Use str.extract with a regex pattern to avoid a loop:

import re

pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
.fillna('fruit not found')

Output:

>>> df
fruit_source value fruit
0 Apple farm 10 Apple
1 Banana field 15 Banana
2 Coconut beach 14 Coconut
3 corn field 10 fruit not found

>>> pattern
'(apple|banana|coconut)'

How to iterate over rows in a DataFrame in Pandas

DataFrame.iterrows is a generator which yields both the index and row (as a Series):

import pandas as pd

df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index() # make sure indexes pair with number of rows

for index, row in df.iterrows():
print(row['c1'], row['c2'])
10 100
11 110
12 120

Most efficient way to loop through and update rows in a large pandas dataframe

I believe here is possible use:

#thanks @Chris A for another solution
t = pd.to_datetime(df['timestamp'], unit='ms')

t = pd.to_datetime(df['timestamp'].astype(int) / 1000)
#alternative
#t = pd.to_datetime(df['timestamp'].apply(int) / 1000)
#t = pd.to_datetime([int(x) / 1000 for x in df['timestamp']] )

df['Time'] = t.dt.strftime('%H:%M:%S')
df['Hour'] = t.dt.hour
df['ChatDate'] = t.dt.strftime('%d-%m-%Y')

Efficient way to iterate through a large dataframe

Let's forget a moment about potential issues with methodology (think about how your results would look if 100k shares traded at a price of 50-51 and 100k traded at 50-59).

Below are a set of commented steps that should achieve your goal:

# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30],
'high': [20, 22, 47, 39],
'volume': [45667, 256565, 45645, 547343]})

# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}

# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
price_dict[price] += volume

# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
for low, high, volume in zip(df.low, df.high, df.volume)]

# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')

>>> df
total_volume_traded
price
10 45667
11 45667
12 45667
13 45667
14 45667
15 302232
16 302232
17 302232
18 302232
19 302232
20 302232
21 256565
22 256565
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 0
41 45645
42 45645
43 45645
44 45645
45 45645
46 45645
47 45645


Related Topics



Leave a reply



Submit