How to iterate over rows in a DataFrame in Pandas
DataFrame.iterrows
is a generator which yields both the index and row (as a Series):
import pandas as pd
df = pd.DataFrame({'c1': [10, 11, 12], 'c2': [100, 110, 120]})
df = df.reset_index() # make sure indexes pair with number of rows
for index, row in df.iterrows():
print(row['c1'], row['c2'])
10 100
11 110
12 120
Iterate over rows in pandas DataFrame and create a dict
You could use groupby
and a dictionary comprehension:
d = {k:list(v) for k,v in df.groupby('name')['val']}
output:
{'p1': [0.0, 1.0], 'p2': [nan, 1.0, 0.0]}
using iterrows (not my favorite option)
NB. this will be quite slower on large dataframes
from collections import defaultdict
d = defaultdict(list)
for _, row in df.iterrows():
d[row['name']].append(row['val'])
dict(d)
Python Pandas iterate over rows and access column names
I also like itertuples()
for row in df.itertuples():
print(row.A)
print(row.Index)
since row is a named tuples, if you meant to access values on each row this should be MUCH faster
speed run :
df = pd.DataFrame([x for x in range(1000*1000)], columns=['A'])
st=time.time()
for index, row in df.iterrows():
row.A
print(time.time()-st)
45.05799984931946
st=time.time()
for row in df.itertuples():
row.A
print(time.time() - st)
0.48400020599365234
Iterating over rows in a dataframe in Pandas: is there a difference between using df.index and df.iterrows() as iterators?
When we doing for loop , look up index get the data require additional loc
for index in df.index:
value = df.loc['index','col']
When we do df.iterrows
for index, row in df.iterrows():
value = row['col']
Since you already with pandas , both of them are not recommended. Unless you need certain function and cannot be vectorized.
However, IMO, I preferred df.index
Iterate over rows polars rust
If you activate the rows
feature in polars, you can try:
DataFrame::get_row
and DataFrame::get_row_amortized
.
The latter is preferred, as that reduces heap allocations by reusing the row buffer.
Anti-pattern
This will be slow. Asking for rows from a columnar data storage will incur many cache misses and goes trough several layers of indirection.
Slightly better
What would be slightly better is using rust iterators. This will have less indirection than the get_row
methods.
df.as_single_chunk_par();
let mut iters = df.columns(["foo", "bar", "ham"])?
.iter().map(|s| s.iter()).collect::<Vec<_>>();
for row in 0..df.height() {
for iter in &mut iters {
let value = iter.next().expect("should have as many iterations as rows");
// process value
}
}
If your DataFrame
consists of a single data type, you should downcast the Series
to a ChunkedArray
, this will speed up iteration.
In the snippet below, we'll assume the data type is Float64
.
let mut iters = df.columns(["foo", "bar", "ham"])?
.iter().map(|s| Ok(s.f64()?.into_iter())).collect::<Result<Vec<_>>>()?;
for row in 0..df.height() {
for iter in &mut iters {
let value = iter.next().expect("should have as many iterations as rows");
// process value
}
}
Iterate over rows of a dataframe based on index in python
If you just want to normalise, you can write the expression directly, using Series.min
and Series.max
:
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
However, if you want the difference between successive elements, you can use Series.diff
:
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
Testing:
df = pd.DataFrame({'time': [0.000000, 0.020373, 0.040598], 'velocity': [0.136731, 0.244889, 0.386443]})
print(df)
# time velocity
# 0 0.000000 0.136731
# 1 0.020373 0.244889
# 2 0.040598 0.386443
m = df['time'].min()
df['normtime'] = (df['time'] - m) / (df['time'].max() - m)
df['difftime'] = df['time'].diff() / (df['time'].max() - df['time'].min())
print(df)
# time velocity normtime difftime
# 0 0.000000 0.136731 0.000000 NaN
# 1 0.020373 0.244889 0.501823 0.501823
# 2 0.040598 0.386443 1.000000 0.498177
Related Topics
Extracting Text from HTML File Using Python
Convert Xml/Html Entities into Unicode String in Python
How to Select a HTML Element No Matter What Frame It Is in in Selenium
How to Change the Environment of a Parent Process in Python
Execute Python Script Via Crontab
Open() in Python Does Not Create a File If It Doesn't Exist
How to Set the 'Backend' in Matplotlib in Python
How to Make a Python Script Run Like a Service or Daemon in Linux
Set Chrome Browser Binary Through Chromedriver in Python
How to Update a Python Package
How to Make Python Script Run as Service
Fail During Installation of Pillow (Python Module) in Linux
How to Listen For 'Usb Device Inserted' Events in Linux, in Python
After Anaconda Installation, Conda Command Fails With "Importerror: No Module Named Conda.Cli"