Count of Unique Values in a Rolling Date Range for R

Count of unique values in a rolling date range for R

Here's something that works, taking advantage of the new non-equijoins feature of data.table.

dt[dt[ , .(date3=date, date2 = date - 2, email)], 
on = .(date >= date2, date<=date3),
allow.cartesian = TRUE
][ , .(count = uniqueN(email)),
by = .(date = date + 2)]
# date V1
# 1: 2011-12-30 3
# 2: 2011-12-31 3
# 3: 2012-01-01 3
# 4: 2012-01-02 3
# 5: 2012-01-03 1
# 6: 2012-01-04 2

To be honest I'm a bit miffed on how this is working exactly, but the idea is to join dt to itself on date, matching any date that is between 2 days ago and today. I'm not sure why we have to clean up by setting date = date + 2 afterwards.


Here's an approach using keys:

setkey(dt, date)

dt[ , .(count = dt[.(seq.Date(.BY$date - 2L, .BY$date, "day")),
uniqueN(email), nomatch = 0L]), by = date]

rolling count of distinct users

Using functions from dplyr and tidyr, for the 1-day window case:

have %>% 
group_by(when) %>%
summarise(twoDayCount = n_distinct(user))

For larger windows:

window <- 2
have %>%
rowwise() %>%
mutate(when = list(when + lubridate::days(0:(window - 1)))) %>%
unnest(cols = when) %>%
group_by(when) %>%
summarise(twoDayCount = n_distinct(user))

Note that this method will give you rows for a few later dates (in this case Jan 08), which you might want to remove.

If performance is an issue for larger datasets, here is a much faster (but slightly less elegant) solution:

window <- 2
seq.Date(min(have$when), max(have$when), by = "day") %>%
purrr::map(function(date) {
have %>%
filter(when <= date, when >= date - days(window - 1)) %>%
summarise(userCount = n_distinct(user)) %>%
mutate(when = date)
}) %>%
bind_rows()

Count unique IDs for each group within x days

An option using non-equi join:

DT[, onemthago := date - 30L]    
DT[, count :=
DT[.SD, on=.(group, date>=onemthago, date<=date),
by=.EACHI, length(unique(ID))]$V1
]

output:

    group       date ID  onemthago count
1: G 2014-04-01 2 2014-03-02 1
2: G 2014-04-12 3 2014-03-13 2
3: F 2014-04-07 4 2014-03-08 1
4: G 2014-05-03 2 2014-04-03 2
5: E 2014-04-14 3 2014-03-15 1
6: E 2014-05-04 1 2014-04-04 2
7: H 2014-03-31 2 2014-03-01 1
8: H 2014-04-18 4 2014-03-19 2
9: H 2014-04-23 2 2014-03-24 2
10: A 2014-04-01 1 2014-03-02 1

data:

date = as.Date(c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01"))
group = c("G","G","F","G","E","E","H","H","H","A")
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1)
library(data.table)
DT <- data.table(group, date, ID)

edit to address comment on multiple lookback periods. You can try something like:

for (x in c(30L, 90L)) {
DT[, daysago := date - x]

DT[, paste0("count", x) :=
.SD[.SD, on=.(group, date>=daysago, date<=date),
by=.EACHI, length(unique(ID))]$V1
][]
}
DT

Count observations over rolling 30 day window

With sapply and between, count the number of observations prior to the current observation that are within 30 days.

library(lubridate)
library(dplyr)
dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))

# A tibble: 16 x 4
# Groups: id [2]
id q date newvar
<chr> <dbl> <date> <int>
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6

dplyr Running count of unique entries

This seems to give the result you are after

df %>%
group_by(subjectID) %>%
mutate(
n_tot = row_number(),
n_case=cumsum(!duplicated(caseID))
)

We use duplicated to see if the case ID is new or not, and then use cumsum() to get a running count of new cases.

Count occurrence of IDs within the last x days in R

A data.table option

dt[, date := as.Date(date)][, count := cumsum(date <= first(date) + 30) - 1, group]

gives

> dt
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 1
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0

A dplyr option following similar idea

dt %>%
mutate(date = as.Date(date)) %>%
group_by(group) %>%
mutate(count = cumsum(date <= first(date) + 30) - 1) %>%
ungroup()

R: Find the count of each unique value of a variable that occurs within timeframe of each observation

This is only a partial answer, since I did not fully understand your 2nd and 3rd problem...

#create data.table with the correct names, based on your sample data (i think)
DT <- dt[, .(person = Species, date, value = Sepal.Width)]
#set keys
setkey(DT, person, date)
#create unique values of `value in the last year before the observation, for each `person
DT[ DT,
#get the unique values for the last year, suppress immediate output with {}
unique_values_prev_year := {
val = DT[ person == i.person & date %between% c( i.date - lubridate::years(1), i.date) ]$value
unique_val = sort( unique( val ) )
list( paste0( unique_val, collapse = "-" ) )
},
#do the above for each row
by = .EACHI ]

output

#         person       date value                           unique_values_prev_year
# 1: setosa 1970-09-14 3.5 3.5
# 2: setosa 1970-09-15 3.0 3-3.5
# 3: setosa 1970-09-16 3.2 3-3.2-3.5
# 4: setosa 1970-09-17 3.1 3-3.1-3.2-3.5
# 5: setosa 1970-09-19 3.9 3-3.1-3.2-3.5-3.9
# ---
# 133: virginica 1970-10-28 3.3 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 134: virginica 1970-10-29 3.0 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 135: virginica 1970-10-30 2.5 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 136: virginica 1970-10-31 3.0 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 137: virginica 1970-11-01 3.4 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8

r - compute rolling sum by id within specific time frame

Not sure this will be helpful with the dimension of your data.

First, create running index to handle duplicate date and roll sum must not include prev dupe date and also create date one year ago (i would argue that 365 is better but seems like OP wants 366).

Then, perform a non-equi self-join while ensuring prev dupe date not used and dates are within a year.

df[, c("rn", "oneYrAgo") := .(.I, date - 366)]

df[df,
.(roll_sum=.N, flag_sum=sum(flag, na.rm=TRUE)),
on=.(date >= oneYrAgo, rn < rn, id, date <= date),
by=.EACHI][,
-seq_len(2L)]

result:

    id       date roll_sum flag_sum
1: 1 2012-03-26 0 0
2: 1 2012-04-26 1 1
3: 1 2015-06-27 0 0
4: 1 2016-06-07 1 0
5: 2 2012-06-22 0 0
6: 2 2012-06-22 1 0
7: 2 2012-10-12 2 0
8: 2 2012-10-22 3 1
9: 2 2012-11-05 4 2
10: 2 2012-11-19 5 3
11: 2 2012-11-26 6 4
12: 2 2013-12-12 0 0
13: 2 2013-12-13 1 1


Related Topics



Leave a reply



Submit