Merge Records Over Time Interval
Set up data
First set up the input data frames. We create two versions of the data frames: A
and B
just use character columns for the times and At
and Bt
use the chron package "times"
class for the times (which has the advantage over "character"
class that one can add and subtract them):
LinesA <- "OBS ID StartTime Duration Outcome
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"
LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"
A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))
# in At and Bt convert times columns to "times" class
library(chron)
At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)
sqldf with times class
Now we can perform the calculation using the sqldf package. We use method="raw"
(which does not assign classes to the output) so we must assign the "times"
class to the output "Time"
column ourself:
library(sqldf)
out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")
out$Time <- times(as.numeric(out$Time))
The result is:
> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird
With the development version of sqldf this can be done without using method="raw"
and the "Time"
column will automatically be set to "times"
class by the sqldf class assignment heuristic:
library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")
sqldf with character class
Its actually possible to not use the "times"
class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:
sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")
EDIT:
A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table
statements.
EDIT:
Simplified/improved final sqldf statement.
merge data rows if they have sequential time intervals within a specified time period
First convert your days to date format, so you can calculate intervals:
df$Start.Date <- as.Date(df$Start.Date, '%m/%d/%Y')
df$Stop.Date <- as.Date(df$Stop.Date, '%m/%d/%Y')
> df$Stop.Date - df$Start.Date
Time differences in days
[1] 3 11 12 14 6 16
To calculate the difference between a stop date and the next start date:
c(Inf, df[-1,'Start.Date'] - df[-nrow(df),'Stop.Date'])
The use of c(SOMETHING,...) is to keep the same length, because the first date don't have an interval. Order your data first to ensure that you have your dates in sequence:
df <- df[order(df$ID, df$Medication, df$Start.Date), ]
Now calculate the intervals for each patient and medication. It can be handily done using dplyr or data.table:
# Using dplyr:
library(dplyr)
df %<>% group_by(ID, Medication) %>% mutate(interval = c(Inf, Start.Date[-1] - Stop.Date[-n()]))
# Using data.table:
library(data.table)
df <- as.data.table(df)
df[, interval := c(Inf, Start.Date[-1] - Stop.Date[-.N]), by = .(ID, Medication)]
Update Stop.Date for rows before intervals with 30 days or less:
for(i in 1:nrow(df)) if(df$interval[i]<=30) df$Stop.Date[i-1] <- df$Stop.Date[i]
Lastly, exclude the rows with interval <= 30 and the column interval:
# If you're using dplyr:
df %<>% filter(!interval<=30) %>% select(-interval)
# If you're using data.table:
df <- df[!interval<=30, ]; df[, interval := NULL]
> df
ID Medication Start.Date Stop.Date
1: 2 aspirin 2017-05-01 2017-06-10
2: 2 aspirin 2017-07-15 2017-07-27
3: 2 tylenol 2017-05-01 2017-05-15
4: 3 lipitor 2017-05-06 2017-05-12
5: 5 advil 2017-05-28 2017-06-13
combine multiple rows into one time interval
Possibly you can do it this way (N = 10s):
library(tidyverse)
dat %>%
group_by(ID) %>%
mutate(
events = cut(date_time, '10 s', labels = F)
) %>%
group_by(events, add = T) %>%
summarise(
start_date_time = min(date_time),
end_date_time = max(date_time)
) %>%
ungroup() %>%
select(-events)
# # A tibble: 4 x 3
# ID start_date_time end_date_time
# <chr> <dttm> <dttm>
# 1 A 2019-11-02 08:07:47 2019-11-02 08:07:50
# 2 A 2019-11-02 08:09:12 2019-11-02 08:09:13
# 3 B 2019-11-02 08:07:48 2019-11-02 08:07:48
# 4 B 2019-11-02 08:09:17 2019-11-02 08:09:17
Data:
structure(list(
ID = c("A", "B", "A", "A", "A", "A", "B"),
date_time = structure(
c(
1572678467,
1572678468,
1572678469,
1572678470,
1572678552,
1572678553,
1572678557
),
class = c("POSIXct", "POSIXt"),
tzone = ""
)
),
row.names = c(NA,-7L),
class = "data.frame")
Efficient way to merge dataframes on time intervals
This is a good case for merge_asof
, which only works if all start
, end
intervals are non-overlapping.
# match the `time` to the largest `start` earlier than it
tmp = pd.merge_asof(df1,df2, left_on='time',right_on='start')
# check if `time` <= `end`
df1['event'] = tmp['event'].where(tmp['end']>=tmp['time']).values
Output:
time value event
42 2020-07-01 00:57:16 0.356310 NaN
43 2020-07-01 01:00:20 0.472270 NaN
44 2020-07-01 03:33:40 0.356310 1.0
45 2020-07-01 03:45:37 0.356310 NaN
46 2020-07-01 03:57:51 0.425053 NaN
47 2020-07-01 05:29:20 0.340471 NaN
For the general case, only thing I can think of is cross-merge and filter:
(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('start <= time <= end')
)
which should be doable with your data sizes, albeit a bit slow.
How to merge consecutive rows of date and time using MySQL
With LAG()
window function:
select Data, Date, Time
from (
select *,
concat(Date, ' ', Time) - interval 30 minute <=
lag(concat(Date, ' ', Time)) over (partition by Data order by Date, Time) flag
from tablename
) t
where coalesce(flag, 0) = 0
order by Data, Date, Time
See the demo.
Results:
> Data | Date | Time
> :---- | :--------- | :-------
> Data1 | 2020-08-19 | 13:00:00
> Data1 | 2020-08-21 | 07:00:00
> Data2 | 2020-08-20 | 08:00:00
Merging records based on a time difference?
The code below manage's to show both merged rows (rows 1-2,4-5) and unique rows (row 3)
SELECT DISTINCT a.id,a.name,a.startdate,a.enddate
FROM temp a
LEFT JOIN temp b ON a.name = b.name AND a.id < b.id AND DATEDIFF(s,a.startdate,b.startdate)<=60
LEFT JOIN temp c ON c.name = a.name AND c.id < a.id AND DATEDIFF(s,c.startdate,a.startdate)<=60
WHERE (b.id IS NOT NULL OR c.id IS NULL) AND a.id <= COALESCE(c.id,a.id)
pandas merge two dataframe by key + datetime and an interval
It rather looks like you want to merge the data. As your intervals are disjoint, this is a perfect use case for merge_asof
.
First ensure to have datetime type and that the data is sorted on the merging dates:
df1['date_start'] = pd.to_datetime(df1['date_start'])
df1['date_expiry'] = pd.to_datetime(df1['date_expiry'])
df2['transaction_datetime'] = pd.to_datetime(df2['transaction_datetime'])
df1 = df1.sort_values(by='date_start')
df2 = df2.sort_values(by='transaction_datetime')
Then perform the merge:
df3 = (
pd.merge_asof(df2, df1, by='code',
left_on='transaction_datetime',
right_on='date_start',
)
.sort_values(by='code')
.drop(['date_start', 'date_expiry'], axis=1)
)
Output:
code transaction_datetime amount tier
0 1 2020-01-02 13:45:05 20 A
1 1 2020-06-22 12:34:41 230 C
3 2 2020-11-12 15:47:35 50 C
2 3 2020-09-03 18:20:34 10 B
How to merge time intervals in SQL Server
You can use a recursive CTE to build a list of dates and then count the distinct dates.
declare @T table
(
startDate date,
endDate date
);
insert into @T values
('2011-01-01', '2011-01-05'),
('2011-01-04', '2011-01-08'),
('2011-01-11', '2011-01-15');
with C as
(
select startDate,
endDate
from @T
union all
select dateadd(day, 1, startDate),
endDate
from C
where dateadd(day, 1, startDate) < endDate
)
select count(distinct startDate) as DayCount
from C
option (MAXRECURSION 0)
Result:
DayCount
-----------
11
Or you can use a numbers table. Here I use master..spt_values:
declare @MinStartDate date
select @MinStartDate = min(startDate)
from @T
select count(distinct N.number)
from @T as T
inner join master..spt_values as N
on dateadd(day, N.Number, @MinStartDate) between T.startDate and dateadd(day, -1, T.endDate)
where N.type = 'P'
Related Topics
How to Get Environment of a Variable in R
R Output Without [1], How to Nicely Format
Using Pivot_Longer with Multiple Paired Columns in the Wide Dataset
R Error: Unknown Timezone with As.Posixct()
Accessing Element of a Split String in R
Filter Groups in Dplyr That Exclusively Contain Specific Combinations of Values
R/Gis: How to Subset a Shapefile by a Lat-Long Bounding Box
How to Create a Plot with Customized Points in R
Inserting Rows into Data Frame When Values Missing in Category
Matrix Display Without Row and Column Names
How to Add Random 'Na's into a Data Frame
Calculate Average Over Multiple Data Frames
Difference Between 'Paste', 'Str_C', 'Str_Join', 'Stri_Join', 'Stri_C', 'Stri_Paste'
Generate Rows Between Two Dates into a Data Frame in R
Draw a Trend Line Using Ggplot