Merge Records Over Time Interval

Merge Records Over Time Interval

Set up data

First set up the input data frames. We create two versions of the data frames: A and B just use character columns for the times and At and Bt use the chron package "times" class for the times (which has the advantage over "character" class that one can add and subtract them):

LinesA <- "OBS ID StartTime Duration Outcome 
1 01 10:12:06 00:00:10 Normal
2 02 10:12:30 00:00:30 Weird
3 01 10:15:12 00:01:15 Normal
4 02 10:45:00 00:00:02 Normal"

LinesB <- "OBS ID Time
1 01 10:12:10
2 01 10:12:17
3 02 10:12:45
4 01 10:13:00"

A <- At <- read.table(textConnection(LinesA), header = TRUE,
colClasses = c("numeric", rep("character", 4)))
B <- Bt <- read.table(textConnection(LinesB), header = TRUE,
colClasses = c("numeric", rep("character", 2)))

# in At and Bt convert times columns to "times" class

library(chron)

At$StartTime <- times(At$StartTime)
At$Duration <- times(At$Duration)
Bt$Time <- times(Bt$Time)

sqldf with times class

Now we can perform the calculation using the sqldf package. We use method="raw" (which does not assign classes to the output) so we must assign the "times" class to the output "Time" column ourself:

library(sqldf)

out <- sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration",
method = "raw")

out$Time <- times(as.numeric(out$Time))

The result is:

> out
OBS ID Time Outcome
1 1 01 10:12:10 Normal
2 3 02 10:12:45 Weird

With the development version of sqldf this can be done without using method="raw" and the "Time" column will automatically be set to "times" class by the sqldf class assignment heuristic:

library(sqldf)
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R") # grab devel ver
sqldf("select Bt.OBS, ID, Time, Outcome from At join Bt using(ID)
where Time between StartTime and StartTime + Duration")

sqldf with character class

Its actually possible to not use the "times" class by performing all time calculations in sqlite out of character strings employing sqlite's strftime function. The SQL statement is unfortunately a bit more involved:

sqldf("select B.OBS, ID, Time, Outcome from A join B using(ID)
where strftime('%s', Time) - strftime('%s', StartTime)
between 0 and strftime('%s', Duration) - strftime('%s', '00:00:00')")

EDIT:

A series of edits which fixed grammar, added additional approaches and fixed/improved the read.table statements.

EDIT:

Simplified/improved final sqldf statement.

merge data rows if they have sequential time intervals within a specified time period

First convert your days to date format, so you can calculate intervals:

df$Start.Date <- as.Date(df$Start.Date, '%m/%d/%Y')
df$Stop.Date <- as.Date(df$Stop.Date, '%m/%d/%Y')

> df$Stop.Date - df$Start.Date
Time differences in days
[1] 3 11 12 14 6 16

To calculate the difference between a stop date and the next start date:

c(Inf, df[-1,'Start.Date'] - df[-nrow(df),'Stop.Date'])

The use of c(SOMETHING,...) is to keep the same length, because the first date don't have an interval. Order your data first to ensure that you have your dates in sequence:

df <- df[order(df$ID, df$Medication, df$Start.Date), ]

Now calculate the intervals for each patient and medication. It can be handily done using dplyr or data.table:

# Using dplyr:
library(dplyr)
df %<>% group_by(ID, Medication) %>% mutate(interval = c(Inf, Start.Date[-1] - Stop.Date[-n()]))

# Using data.table:
library(data.table)
df <- as.data.table(df)
df[, interval := c(Inf, Start.Date[-1] - Stop.Date[-.N]), by = .(ID, Medication)]

Update Stop.Date for rows before intervals with 30 days or less:

for(i in 1:nrow(df)) if(df$interval[i]<=30) df$Stop.Date[i-1] <- df$Stop.Date[i]

Lastly, exclude the rows with interval <= 30 and the column interval:

# If you're using dplyr:
df %<>% filter(!interval<=30) %>% select(-interval)

# If you're using data.table:
df <- df[!interval<=30, ]; df[, interval := NULL]

> df
ID Medication Start.Date Stop.Date
1: 2 aspirin 2017-05-01 2017-06-10
2: 2 aspirin 2017-07-15 2017-07-27
3: 2 tylenol 2017-05-01 2017-05-15
4: 3 lipitor 2017-05-06 2017-05-12
5: 5 advil 2017-05-28 2017-06-13

combine multiple rows into one time interval

Possibly you can do it this way (N = 10s):

library(tidyverse)

dat %>%
group_by(ID) %>%
mutate(
events = cut(date_time, '10 s', labels = F)
) %>%
group_by(events, add = T) %>%
summarise(
start_date_time = min(date_time),
end_date_time = max(date_time)
) %>%
ungroup() %>%
select(-events)

# # A tibble: 4 x 3
# ID start_date_time end_date_time
# <chr> <dttm> <dttm>
# 1 A 2019-11-02 08:07:47 2019-11-02 08:07:50
# 2 A 2019-11-02 08:09:12 2019-11-02 08:09:13
# 3 B 2019-11-02 08:07:48 2019-11-02 08:07:48
# 4 B 2019-11-02 08:09:17 2019-11-02 08:09:17

Data:

structure(list(
ID = c("A", "B", "A", "A", "A", "A", "B"),
date_time = structure(
c(
1572678467,
1572678468,
1572678469,
1572678470,
1572678552,
1572678553,
1572678557
),
class = c("POSIXct", "POSIXt"),
tzone = ""
)
),
row.names = c(NA,-7L),
class = "data.frame")

Efficient way to merge dataframes on time intervals

This is a good case for merge_asof, which only works if all start, end intervals are non-overlapping.

# match the `time` to the largest `start` earlier than it
tmp = pd.merge_asof(df1,df2, left_on='time',right_on='start')

# check if `time` <= `end`
df1['event'] = tmp['event'].where(tmp['end']>=tmp['time']).values

Output:

                  time     value  event
42 2020-07-01 00:57:16 0.356310 NaN
43 2020-07-01 01:00:20 0.472270 NaN
44 2020-07-01 03:33:40 0.356310 1.0
45 2020-07-01 03:45:37 0.356310 NaN
46 2020-07-01 03:57:51 0.425053 NaN
47 2020-07-01 05:29:20 0.340471 NaN

For the general case, only thing I can think of is cross-merge and filter:

(df1.assign(key=1)
.merge(df2.assign(key=1), on='key')
.query('start <= time <= end')
)

which should be doable with your data sizes, albeit a bit slow.

How to merge consecutive rows of date and time using MySQL

With LAG() window function:

select Data, Date, Time
from (
select *,
concat(Date, ' ', Time) - interval 30 minute <=
lag(concat(Date, ' ', Time)) over (partition by Data order by Date, Time) flag
from tablename
) t
where coalesce(flag, 0) = 0
order by Data, Date, Time

See the demo.

Results:

> Data  | Date       | Time    
> :---- | :--------- | :-------
> Data1 | 2020-08-19 | 13:00:00
> Data1 | 2020-08-21 | 07:00:00
> Data2 | 2020-08-20 | 08:00:00

Merging records based on a time difference?

The code below manage's to show both merged rows (rows 1-2,4-5) and unique rows (row 3)

SELECT DISTINCT a.id,a.name,a.startdate,a.enddate
FROM temp a
LEFT JOIN temp b ON a.name = b.name AND a.id < b.id AND DATEDIFF(s,a.startdate,b.startdate)<=60
LEFT JOIN temp c ON c.name = a.name AND c.id < a.id AND DATEDIFF(s,c.startdate,a.startdate)<=60
WHERE (b.id IS NOT NULL OR c.id IS NULL) AND a.id <= COALESCE(c.id,a.id)

Sample Image

pandas merge two dataframe by key + datetime and an interval

It rather looks like you want to merge the data. As your intervals are disjoint, this is a perfect use case for merge_asof.

First ensure to have datetime type and that the data is sorted on the merging dates:

df1['date_start'] = pd.to_datetime(df1['date_start'])
df1['date_expiry'] = pd.to_datetime(df1['date_expiry'])
df2['transaction_datetime'] = pd.to_datetime(df2['transaction_datetime'])

df1 = df1.sort_values(by='date_start')
df2 = df2.sort_values(by='transaction_datetime')

Then perform the merge:

df3 = (
pd.merge_asof(df2, df1, by='code',
left_on='transaction_datetime',
right_on='date_start',
)
.sort_values(by='code')
.drop(['date_start', 'date_expiry'], axis=1)
)

Output:

   code transaction_datetime  amount tier
0 1 2020-01-02 13:45:05 20 A
1 1 2020-06-22 12:34:41 230 C
3 2 2020-11-12 15:47:35 50 C
2 3 2020-09-03 18:20:34 10 B

How to merge time intervals in SQL Server

You can use a recursive CTE to build a list of dates and then count the distinct dates.

declare @T table
(
startDate date,
endDate date
);

insert into @T values
('2011-01-01', '2011-01-05'),
('2011-01-04', '2011-01-08'),
('2011-01-11', '2011-01-15');

with C as
(
select startDate,
endDate
from @T
union all
select dateadd(day, 1, startDate),
endDate
from C
where dateadd(day, 1, startDate) < endDate
)
select count(distinct startDate) as DayCount
from C
option (MAXRECURSION 0)

Result:

DayCount
-----------
11

Or you can use a numbers table. Here I use master..spt_values:

declare @MinStartDate date
select @MinStartDate = min(startDate)
from @T

select count(distinct N.number)
from @T as T
inner join master..spt_values as N
on dateadd(day, N.Number, @MinStartDate) between T.startDate and dateadd(day, -1, T.endDate)
where N.type = 'P'


Related Topics



Leave a reply



Submit