Calculate Differences Between Rows Faster Than a for Loop

Calculate differences between rows faster than a for loop?

This should work if your the dates are in order within id.

id<-c(123,123,124,124)
date<-as.Date(c('2010-01-15','2010-01-01','2010-03-05','2010-01-05'))
score<-c(10,15,20,30)
data<-data.frame(id,date,score)

data <- data[order(data$id,data$date),]
data$dayssincelast<-do.call(c,by(data$date,data$id,function(x) c(NA,diff(x))))
# Or, even more concisely
data$dayssincelast<-unlist(by(data$date,data$id,function(x) c(NA,diff(x))))

Time difference between rows of data. Is there a faster way to do this than a for loop?

After a quick research on stackoverflow I found that converting the data into a list and using do.call (as suggested in an answer to a related question) was much faster

start.time.A <- Sys.time()
L <- list(days.seq)
LS <- do.call(diff,L)
attributes(LS) <- NULL
LS <- rbind(0,as.data.frame(LS))
str(LS)
start.time.B <- Sys.time()
difftime(start.time.B,start.time.A,'secs')

pandas - iterate over rows and calculate - faster

IIUC, you can do:

df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
df['overlap_count'] += df['col1'].le(df['col2'].shift(i))

# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan

Output:

   col1  col2  overlap_count
0 20 39 NaN
1 23 32 NaN
2 40 42 NaN
3 41 50 1.0
4 48 63 1.0
5 49 68 2.0
6 50 68 3.0
7 50 69 3.0

Takes about 11ms on for 800 rows and start_at_nr=3.

Pandas - Interate over row and compare previous values -faster

I think you can use numba for improve performance, only is necessary working with numeric values, so instead x is added -1 and new column is filled by 0 instead empty string:

df["overlap_count"] = 0  #create new column
n = 3 #if x >= n, then value = 0

a = df[['col1','col2','overlap_count']].values

from numba import njit

@njit
def custom_sum(arr, n):
for row in range(arr.shape[0]):
x = (arr[0:row, 1] > arr[row, 0]).sum()
arr[row, 2] = x
if x >= n:
arr[row, 1] = 0
arr[row, 2] = -1
return arr

df1 = pd.DataFrame(custom_sum(a, n), columns=df.columns)
print (df1)
col1 col2 overlap_count
0 20 39 0
1 23 32 1
2 40 42 0
3 41 50 1
4 46 63 1
5 47 67 2
6 48 0 -1
7 49 0 -1
8 50 68 2
9 50 0 -1
10 52 0 -1
11 55 0 -1
12 56 0 -1
13 69 71 0
14 70 66 1

Performance:

d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)

#4500rows
df = pd.concat([df] * 300, ignore_index=True)

print (df)
In [115]: %%timeit
...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
...:
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [116]: %%timeit
...: for row in range(len(df)):
...: x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
...: df["overlap_count"].loc[row] = x
...:
...: if x >= n:
...: df["col2"].loc[row] = 0
...: df["overlap_count"].loc[row] = 'x'
...:
...:
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficient and Faster way to calculate the below problem without memory usage of python

Two changes can be applied to speed up this code:

  • combination.append is very slow because it recreate a new dataframe for each new appended line. You can append the lines to a Python list and then use create the final dataframe from the resulting list. This should be much much faster with a list.
  • The inner m-based loop can be vectorized using Numpy. You can compute calc_val by working directly on columns and not values and you can use where of Numpy to filter the elements.

A fast, efficient way to calculate time differences between groups of rows in pandas?

using native pandas methods over a df.groupby should give significant performance boost over a "native python" loop:

df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

Here's a small benchmark (on my laptop, YMMV...) using 100 cars with 31 days each,
showing an almost 10x performance boost:

import pandas as pd
import timeit

data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])

def original_method():
for c in df['carId'].unique():
df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
'refill_date'].diff()

def using_groupby():
df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)

print(time1)
print(time2)
print(time1/time2)

Output:

16.6183732
1.7910263000000022
9.278687420726307

More efficient way to extract and subtract rows R in different dataframes

Here's an approach using tidyverse packages that I expect should be much faster than the loop solution in the OP. The speed (I expect) comes from relying more on database join operations (base merge or dplyr's left_join, for example) to connect the two tables.

library(tidyverse)

# First, use the first few columns from the `games` table, and convert to long format with
# a row for each team, and a label column `team_cat` telling us if it's a teamA or teamB.
stat_differences <- games %>%
select(row, Season, teamA, teamB) %>%
gather(team_cat, teamID, teamA:teamB) %>%

# Join to the teamStats table to bring in the team's total stats for that year
left_join(teamStats %>% select(-row), # We don't care about this "row"
by = c("teamID", "Season" = "Year")) %>%

# Now I want to reverse the stats' sign if it's a teamB. To make this simpler, I gather
# all the stats into long format so that we can do the reversal on all of them, and
# then spread back out.
gather(stat, value, G:L) %>%
mutate(value = if_else(team_cat == "teamB", value * -1, value * 1)) %>%
spread(stat, value) %>%

# Get the difference in stats for each row in the original games table.
group_by(row) %>%
summarise_at(vars(G:W), sum)

# Finally, add the output to the original table
output <- games %>%
left_join(stat_differences)

To test this, I altered the given sample data so that the two tables would relate to each other:

games <- read.table(header = T, stringsAsFactors = F,
text = "row Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100")

teamStats <- read.table(header = T, stringsAsFactors = F,
text = "row School Year teamID G W L
1 abilene_christian 2010 1143 32 16 16
2 air_force 2010 1293 31 12 19
3 akron 2010 1314 32 14 18
4 alabama_a&m 2010 1198 31 3 28
5 alabama-birmingham 2010 1108 33 20 13
6 made_up_team 2018 1326 160 150 10 # To confirm getting right season
7 made_up_team 2010 1326 60 50 10"
)

Then I get the following output, which seems to make sense.
(I just realized that the gather/mutate/spread I applied changed the order of the columns; if I have time I might try to use a mutate_if to preserve the order.)

> output
row Season teamA teamB winner scoreA scoreB G L W
1 108123 2010 1143 1293 A 75 70 1 -3 4
2 108124 2010 1198 1314 B 72 88 -1 10 -11
3 108125 2010 1108 1326 B 60 100 -27 3 -30

Speed up the processing time of for loop for big data in R

This should speed things up considerably.

On my systemn, the speed gain is about a factor 5.

#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"

library(data.table)
# use fread for reading, fast and get a nice progress bar as bonus
bdd_cases <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id1))
bdd_control <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id2))
#Put everything in a list
L <- lapply(unique(bdd_cases$cluster_case), function(x){
temp <- rbind(bdd_cases[cluster_case == x, ],
bdd_control[subset == bdd_cases[cluster_case == x, ]$subset])
temp[, cluster_case := x]
temp[, `:=`(age_diff = abs(age - age[case_control=="case"]),
fup_diff = foll_up - foll_up[case_control=="case"])]
temp[age_diff <= 2 & fup_diff == 0, ]
})
#Rowbind the list
final <- rbindlist(L, use.names = TRUE, fill = TRUE)


Related Topics



Leave a reply



Submit