What is the most efficient way to loop through dataframes with pandas?
The newest versions of pandas now include a built-in function for iterating over rows.
for index, row in df.iterrows():
# do some logic here
Or, if you want it faster use itertuples()
But, unutbu's suggestion to use numpy functions to avoid iterating over rows will produce the fastest code.
best way to iterate through elements of pandas Series
TL;DR
Iterating in pandas is an antipattern and can usually be avoided by vectorizing, applying, aggregating, transforming, or cythonizing.
However if Series iteration is absolutely necessary, performance will depend on the dtype and index:
Index | Fastest if numpy dtype | Fastest if pandas dtype | Idiomatic |
---|---|---|---|
Unneeded | in s.to_numpy() | in s.array | in s |
Default | in enumerate(s.to_numpy()) | in enumerate(s.array) | in s.items() |
Custom | in zip(s.index, s.to_numpy()) | in s.items() | in s.items() |
Faster way to iterate through Pandas Dataframe?
Use str.extract
with a regex pattern to avoid a loop:
import re
pattern = fr"({'|'.join(fruit_list)})"
df['fruit'] = df['fruit_source'].str.extract(pattern, flags=re.IGNORECASE) \
.fillna('fruit not found')
Output:
>>> df
fruit_source value fruit
0 Apple farm 10 Apple
1 Banana field 15 Banana
2 Coconut beach 14 Coconut
3 corn field 10 fruit not found
>>> pattern
'(apple|banana|coconut)'
How to iterate over rows in a DataFrame in Pandas
DataFrame.iterrows
is a generator which yields both the index and row (as a Series):
import pandas as pd
df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
print(row['c1'], row['c2'])
10 100
11 110
12 120
Most efficient way to loop through and update rows in a large pandas dataframe
I believe here is possible use:
#thanks @Chris A for another solution
t = pd.to_datetime(df['timestamp'], unit='ms')
t = pd.to_datetime(df['timestamp'].astype(int) / 1000)
#alternative
#t = pd.to_datetime(df['timestamp'].apply(int) / 1000)
#t = pd.to_datetime([int(x) / 1000 for x in df['timestamp']] )
df['Time'] = t.dt.strftime('%H:%M:%S')
df['Hour'] = t.dt.hour
df['ChatDate'] = t.dt.strftime('%d-%m-%Y')
Efficient way to iterate through a large dataframe
Let's forget a moment about potential issues with methodology (think about how your results would look if 100k shares traded at a price of 50-51 and 100k traded at 50-59).
Below are a set of commented steps that should achieve your goal:
# Initialize DataFrame.
df = pd.DataFrame({'low': [10, 15, 41, 30],
'high': [20, 22, 47, 39],
'volume': [45667, 256565, 45645, 547343]})
# Initialize a price dictionary spanning range of potential prices.
d = {price: 0 for price in range(min(df.low), max(df.high) + 1)}
# Create helper function to add volume to given price bucket.
def add_volume(price_dict, price, volume):
price_dict[price] += volume
# Use a nested list comprehension to call the function and populate the dictionary.
_ = [[add_volume(d, price, volume) for price in range(low, high + 1)]
for low, high, volume in zip(df.low, df.high, df.volume)]
# Convert the dictionary to a DataFrame and output to csv.
idx = pd.Index(d.keys(), name='price')
df = pd.DataFrame(d.values(), index=idx, columns=['total_volume_traded'])
df.to_csv('output.csv')
>>> df
total_volume_traded
price
10 45667
11 45667
12 45667
13 45667
14 45667
15 302232
16 302232
17 302232
18 302232
19 302232
20 302232
21 256565
22 256565
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 547343
31 547343
32 547343
33 547343
34 547343
35 547343
36 547343
37 547343
38 547343
39 547343
40 0
41 45645
42 45645
43 45645
44 45645
45 45645
46 45645
47 45645
Related Topics
Unicodedecodeerror Reading Binary Input
Why Is My Pygame Application Loop Not Working Properly
Your CPU Supports Instructions That This Tensorflow Binary Was Not Compiled to Use: Avx Avx2
How to Return Dictionary Keys as a List in Python
How to Redirect 'Print' Output to a File
Plotting Time in Python with Matplotlib
Psycopg2: Insert Multiple Rows with One Query
Creating a JSON Response Using Django and Python
How to "Perfectly" Override a Dict
Convert Base-2 Binary Number String to Int
Remove Duplicate Dict in List in Python
Removing Duplicates from a List of Lists
What Does the Slash Mean in Help() Output
Using Quotation Marks Inside Quotation Marks
Multi Platform Portable Python
Magicexception:File 5.41 Supports Only Version 16 Magic File, Magic.Mgc Is Version 14