Calculate differences between rows faster than a for loop?
This should work if your the dates are in order within id
.
id<-c(123,123,124,124)
date<-as.Date(c('2010-01-15','2010-01-01','2010-03-05','2010-01-05'))
score<-c(10,15,20,30)
data<-data.frame(id,date,score)
data <- data[order(data$id,data$date),]
data$dayssincelast<-do.call(c,by(data$date,data$id,function(x) c(NA,diff(x))))
# Or, even more concisely
data$dayssincelast<-unlist(by(data$date,data$id,function(x) c(NA,diff(x))))
Time difference between rows of data. Is there a faster way to do this than a for loop?
After a quick research on stackoverflow I found that converting the data into a list and using do.call
(as suggested in an answer to a related question) was much faster
start.time.A <- Sys.time()
L <- list(days.seq)
LS <- do.call(diff,L)
attributes(LS) <- NULL
LS <- rbind(0,as.data.frame(LS))
str(LS)
start.time.B <- Sys.time()
difftime(start.time.B,start.time.A,'secs')
pandas - iterate over rows and calculate - faster
IIUC, you can do:
df['overlap_count'] = 0
for i in range(1,start_at_nr+1):
df['overlap_count'] += df['col1'].le(df['col2'].shift(i))
# mask the first few rows
df.iloc[:start_at_nr, -1] = np.nan
Output:
col1 col2 overlap_count
0 20 39 NaN
1 23 32 NaN
2 40 42 NaN
3 41 50 1.0
4 48 63 1.0
5 49 68 2.0
6 50 68 3.0
7 50 69 3.0
Takes about 11ms on for 800 rows and start_at_nr=3
.
Pandas - Interate over row and compare previous values -faster
I think you can use numba
for improve performance, only is necessary working with numeric values, so instead x
is added -1
and new column is filled by 0
instead empty string:
df["overlap_count"] = 0 #create new column
n = 3 #if x >= n, then value = 0
a = df[['col1','col2','overlap_count']].values
from numba import njit
@njit
def custom_sum(arr, n):
for row in range(arr.shape[0]):
x = (arr[0:row, 1] > arr[row, 0]).sum()
arr[row, 2] = x
if x >= n:
arr[row, 1] = 0
arr[row, 2] = -1
return arr
df1 = pd.DataFrame(custom_sum(a, n), columns=df.columns)
print (df1)
col1 col2 overlap_count
0 20 39 0
1 23 32 1
2 40 42 0
3 41 50 1
4 46 63 1
5 47 67 2
6 48 0 -1
7 49 0 -1
8 50 68 2
9 50 0 -1
10 52 0 -1
11 55 0 -1
12 56 0 -1
13 69 71 0
14 70 66 1
Performance:
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)
#4500rows
df = pd.concat([df] * 300, ignore_index=True)
print (df)
In [115]: %%timeit
...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
...:
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [116]: %%timeit
...: for row in range(len(df)):
...: x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
...: df["overlap_count"].loc[row] = x
...:
...: if x >= n:
...: df["col2"].loc[row] = 0
...: df["overlap_count"].loc[row] = 'x'
...:
...:
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Efficient and Faster way to calculate the below problem without memory usage of python
Two changes can be applied to speed up this code:
combination.append
is very slow because it recreate a new dataframe for each new appended line. You can append the lines to a Python list and then use create the final dataframe from the resulting list. This should be much much faster with a list.- The inner
m
-based loop can be vectorized using Numpy. You can computecalc_val
by working directly on columns and not values and you can usewhere
of Numpy to filter the elements.
A fast, efficient way to calculate time differences between groups of rows in pandas?
using native pandas methods over a df.groupby
should give significant performance boost over a "native python" loop:
df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()
Here's a small benchmark (on my laptop, YMMV...) using 100 cars with 31 days each,
showing an almost 10x performance boost:
import pandas as pd
import timeit
data = [{"carId": carId, "refill_date": "2020-3-"+str(day)} for carId in range(1,100) for day in range(1,32)]
df = pd.DataFrame(data)
df['refill_date'] = pd.to_datetime(df['refill_date'])
def original_method():
for c in df['carId'].unique():
df.loc[df['carId'] == c, 'time_elapsed'] = df.loc[df['carId'] == c,
'refill_date'].diff()
def using_groupby():
df['time_elapsed'] = df.groupby('carId')['refill_date'].diff()
time1 = timeit.timeit('original_method()', globals=globals(), number=100)
time2 = timeit.timeit('using_groupby()', globals=globals(), number=100)
print(time1)
print(time2)
print(time1/time2)
Output:
16.6183732
1.7910263000000022
9.278687420726307
More efficient way to extract and subtract rows R in different dataframes
Here's an approach using tidyverse
packages that I expect should be much faster than the loop solution in the OP. The speed (I expect) comes from relying more on database join operations (base merge
or dplyr's left_join
, for example) to connect the two tables.
library(tidyverse)
# First, use the first few columns from the `games` table, and convert to long format with
# a row for each team, and a label column `team_cat` telling us if it's a teamA or teamB.
stat_differences <- games %>%
select(row, Season, teamA, teamB) %>%
gather(team_cat, teamID, teamA:teamB) %>%
# Join to the teamStats table to bring in the team's total stats for that year
left_join(teamStats %>% select(-row), # We don't care about this "row"
by = c("teamID", "Season" = "Year")) %>%
# Now I want to reverse the stats' sign if it's a teamB. To make this simpler, I gather
# all the stats into long format so that we can do the reversal on all of them, and
# then spread back out.
gather(stat, value, G:L) %>%
mutate(value = if_else(team_cat == "teamB", value * -1, value * 1)) %>%
spread(stat, value) %>%
# Get the difference in stats for each row in the original games table.
group_by(row) %>%
summarise_at(vars(G:W), sum)
# Finally, add the output to the original table
output <- games %>%
left_join(stat_differences)
To test this, I altered the given sample data so that the two tables would relate to each other:
games <- read.table(header = T, stringsAsFactors = F,
text = "row Season teamA teamB winner scoreA scoreB
108123 2010 1143 1293 A 75 70
108124 2010 1198 1314 B 72 88
108125 2010 1108 1326 B 60 100")
teamStats <- read.table(header = T, stringsAsFactors = F,
text = "row School Year teamID G W L
1 abilene_christian 2010 1143 32 16 16
2 air_force 2010 1293 31 12 19
3 akron 2010 1314 32 14 18
4 alabama_a&m 2010 1198 31 3 28
5 alabama-birmingham 2010 1108 33 20 13
6 made_up_team 2018 1326 160 150 10 # To confirm getting right season
7 made_up_team 2010 1326 60 50 10"
)
Then I get the following output, which seems to make sense.
(I just realized that the gather/mutate/spread I applied changed the order of the columns; if I have time I might try to use a mutate_if to preserve the order.)
> output
row Season teamA teamB winner scoreA scoreB G L W
1 108123 2010 1143 1293 A 75 70 1 -3 4
2 108124 2010 1198 1314 B 72 88 -1 10 -11
3 108125 2010 1108 1326 B 60 100 -27 3 -30
Speed up the processing time of for loop for big data in R
This should speed things up considerably.
On my systemn, the speed gain is about a factor 5.
#import data
id1 <- "199TNlYFwqzzWpi1iY5qX1-M11UoC51Cp"
id2 <- "1TeFCkqLDtEBz0JMBHh8goNWEjYol4O2z"
library(data.table)
# use fread for reading, fast and get a nice progress bar as bonus
bdd_cases <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id1))
bdd_control <- fread(sprintf("https://docs.google.com/uc?id=%s&export=download", id2))
#Put everything in a list
L <- lapply(unique(bdd_cases$cluster_case), function(x){
temp <- rbind(bdd_cases[cluster_case == x, ],
bdd_control[subset == bdd_cases[cluster_case == x, ]$subset])
temp[, cluster_case := x]
temp[, `:=`(age_diff = abs(age - age[case_control=="case"]),
fup_diff = foll_up - foll_up[case_control=="case"])]
temp[age_diff <= 2 & fup_diff == 0, ]
})
#Rowbind the list
final <- rbindlist(L, use.names = TRUE, fill = TRUE)
Related Topics
Removing/Replacing Brackets from R String Using Gsub
How to Get The R Shiny Downloadhandler Filename to Work
Visualizing Distance Between Nodes According to Weights - with R
Existing Function to Combine Standard Deviations in R
Mlogit: Missing Value Where True/False Needed
Margins Between Plots in Grid.Arrange
Combining Date and Time into a Date Column for Plotting
Cannot Install Rgdal Package in R on Rhel6, Unable to Load Shared Object Rgdal.So
How to Fix Degree Symbol Not Showing Correctly in R on Linux/Fedora 31
How to Use Stat_Bin2D() to Compute Counts Labels in Ggplot2
Recursive Function Using Dplyr
How to Get All Possible Combinations of N Number of Data Set
R: Apply Function to Matrix with Elements of Vector as Argument
Get Start and End Index of Runs of Values
Error in Xj[I]: Invalid Subscript Type 'List'
Remove Blank Lines from Plot Geom_Tile Ggplot