Calculate Differences Between Rows Faster Than a for Loop

Calculate differences between rows faster than a for loop?

This should work if your the dates are in order within id.

id<-c(123,123,124,124)
date<-as.Date(c('2010-01-15','2010-01-01','2010-03-05','2010-01-05'))
score<-c(10,15,20,30)
data<-data.frame(id,date,score)

data <- data[order(data$id,data$date),]
data$dayssincelast<-do.call(c,by(data$date,data$id,function(x) c(NA,diff(x))))
# Or, even more concisely
data$dayssincelast<-unlist(by(data$date,data$id,function(x) c(NA,diff(x))))

Time difference between rows of data. Is there a faster way to do this than a for loop?

After a quick research on stackoverflow I found that converting the data into a list and using do.call (as suggested in an answer to a related question) was much faster

start.time.A <- Sys.time()
L <- list(days.seq)
LS <- do.call(diff,L)
attributes(LS) <- NULL
LS <- rbind(0,as.data.frame(LS))
str(LS)
start.time.B <- Sys.time()
difftime(start.time.B,start.time.A,'secs')

pandas - iterate over rows and calculate - faster

IIUC, you can do:

df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
    df['overlap_count'] += df['col1'].le(df['col2'].shift(i))

# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan

Output:

   col1  col2  overlap_count
0    20    39            NaN
1    23    32            NaN
2    40    42            NaN
3    41    50            1.0
4    48    63            1.0
5    49    68            2.0
6    50    68            3.0
7    50    69            3.0

Takes about 11ms on for 800 rows and start_at_nr=3.

Pandas - Interate over row and compare previous values -faster

I think you can use numba for improve performance, only is necessary working with numeric values, so instead x is added -1 and new column is filled by 0 instead empty string:

df["overlap_count"] = 0  #create new column
n = 3 #if x >= n, then value = 0

a = df[['col1','col2','overlap_count']].values

from numba import njit

@njit
def custom_sum(arr, n):
    for row in range(arr.shape[0]):
        x = (arr[0:row, 1] > arr[row, 0]).sum()
        arr[row, 2] = x
        if x >= n:
            arr[row, 1] = 0
            arr[row, 2] = -1
    return arr

df1 = pd.DataFrame(custom_sum(a, n), columns=df.columns)
print (df1)
    col1  col2  overlap_count
0     20    39              0
1     23    32              1
2     40    42              0
3     41    50              1
4     46    63              1
5     47    67              2
6     48     0             -1
7     49     0             -1
8     50    68              2
9     50     0             -1
10    52     0             -1
11    55     0             -1
12    56     0             -1
13    69    71              0
14    70    66              1

Performance:

d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
    'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)

#4500rows
df = pd.concat([df] * 300, ignore_index=True)

print (df)
In [115]: %%timeit
     ...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
     ...: 
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [116]: %%timeit 
     ...: for row in range(len(df)):
     ...:         x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
     ...:         df["overlap_count"].loc[row] = x
     ...: 
     ...:         if x >= n:                 
     ...:             df["col2"].loc[row] = 0
     ...:             df["overlap_count"].loc[row] = 'x'
     ...:             
     ...:             
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Efficient and Faster way to calculate the below problem without memory usage of python

Two changes can be applied to speed up this code:

combination.append is very slow because it recreate a new dataframe for each new appended line. You can append the lines to a Python list and then use create the final dataframe from the resulting list. This should be much much faster with a list.
The inner m-based loop can be vectorized using Numpy. You can compute calc_val by working directly on columns and not values and you can use where of Numpy to filter the elements.

A fast, efficient way to calculate time differences between groups of rows in pandas?

using native pandas methods over a df.groupby should give significant performance boost over a "native python" loop:

df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

Here's a small benchmark (on my laptop, YMMV...) using 100 cars with 31 days each,
showing an almost 10x performance boost:

import pandas as pd
import timeit

data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])

def original_method():
    for c in df['carId'].unique():
        df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
                                                          'refill_date'].diff()

def using_groupby():
    df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()

time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)

print(time1)
print(time2)
print(time1/time2)

Output:

16.6183732
1.7910263000000022
9.278687420726307

More efficient way to extract and subtract rows R in different dataframes

Here's an approach using tidyverse packages that I expect should be much faster than the loop solution in the OP. The speed (I expect) comes from relying more on database join operations (base merge or dplyr's left_join, for example) to connect the two tables.

library(tidyverse)

# First, use the first few columns from the `games` table, and convert to long format with
#   a row for each team, and a label column `team_cat` telling us if it's a teamA or teamB.
stat_differences <- games %>%
  select(row, Season, teamA, teamB)  %>% 
  gather(team_cat, teamID, teamA:teamB) %>%  

# Join to the teamStats table to bring in the team's total stats for that year
  left_join(teamStats %>% select(-row),    # We don't care about this "row"
            by = c("teamID", "Season" = "Year")) %>%

# Now I want to reverse the stats' sign if it's a teamB. To make this simpler, I gather
#   all the stats into long format so that we can do the reversal on all of them, and 
#   then spread back out.
  gather(stat, value, G:L) %>%
  mutate(value = if_else(team_cat == "teamB", value * -1, value * 1)) %>%
  spread(stat, value) %>%

# Get the difference in stats for each row in the original games table.
  group_by(row) %>%
  summarise_at(vars(G:W), sum)

# Finally, add the output to the original table
output <- games %>% 
  left_join(stat_differences)

To test this, I altered the given sample data so that the two tables would relate to each other:

games <- read.table(header = T, stringsAsFactors = F,
  text = "row           Season teamA teamB winner scoreA scoreB
108123   2010  1143  1293      A     75     70
108124   2010  1198  1314      B     72     88
108125   2010  1108  1326      B     60    100")

teamStats <- read.table(header = T, stringsAsFactors = F,
  text = "row   School Year teamID  G  W  L
1  abilene_christian 2010   1143 32 16 16
2          air_force 2010   1293 31 12 19
3              akron 2010   1314 32 14 18
4        alabama_a&m 2010   1198 31  3 28
5 alabama-birmingham 2010   1108 33 20 13
6       made_up_team 2018   1326 160 150 10    # To confirm getting right season
7       made_up_team 2010   1326 60 50 10"
)

Then I get the following output, which seems to make sense.
(I just realized that the gather/mutate/spread I applied changed the order of the columns; if I have time I might try to use a mutate_if to preserve the order.)

> output
     row Season teamA teamB winner scoreA scoreB   G  L   W
1 108123   2010  1143  1293      A     75     70   1 -3   4
2 108124   2010  1198  1314      B     72     88  -1 10 -11
3 108125   2010  1108  1326      B     60    100 -27  3 -30

Speed up the processing time of for loop for big data in R

This should speed things up considerably.

On my systemn, the speed gain is about a factor 5.

#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"

library(data.table)
# use fread for reading, fast and get a nice progress bar as bonus
bdd_cases <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id1))
bdd_control <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id2))
#Put everything in a list
L <- lapply(unique(bdd_cases$cluster_case), function(x){
  temp <- rbind(bdd_cases[cluster_case == x, ],
                bdd_control[subset == bdd_cases[cluster_case == x, ]$subset])
  temp[, cluster_case := x]
  temp[, `:=`(age_diff = abs(age - age[case_control=="case"]),
              fup_diff = foll_up - foll_up[case_control=="case"])]
  temp[age_diff <= 2 & fup_diff == 0, ]
})
#Rowbind the list
final <- rbindlist(L, use.names = TRUE, fill = TRUE)

Calculate Differences Between Rows Faster Than a for Loop