Count of unique values in a rolling date range for R
Here's something that works, taking advantage of the new non-equijoins feature of data.table
.
dt[dt[ , .(date3=date, date2 = date - 2, email)],
on = .(date >= date2, date<=date3),
allow.cartesian = TRUE
][ , .(count = uniqueN(email)),
by = .(date = date + 2)]
# date V1
# 1: 2011-12-30 3
# 2: 2011-12-31 3
# 3: 2012-01-01 3
# 4: 2012-01-02 3
# 5: 2012-01-03 1
# 6: 2012-01-04 2
To be honest I'm a bit miffed on how this is working exactly, but the idea is to join dt
to itself on date
, matching any date
that is between 2 days ago and today. I'm not sure why we have to clean up by setting date = date + 2
afterwards.
Here's an approach using keys:
setkey(dt, date)
dt[ , .(count = dt[.(seq.Date(.BY$date - 2L, .BY$date, "day")),
uniqueN(email), nomatch = 0L]), by = date]
rolling count of distinct users
Using functions from dplyr
and tidyr
, for the 1-day window case:
have %>%
group_by(when) %>%
summarise(twoDayCount = n_distinct(user))
For larger windows:
window <- 2
have %>%
rowwise() %>%
mutate(when = list(when + lubridate::days(0:(window - 1)))) %>%
unnest(cols = when) %>%
group_by(when) %>%
summarise(twoDayCount = n_distinct(user))
Note that this method will give you rows for a few later dates (in this case Jan 08), which you might want to remove.
If performance is an issue for larger datasets, here is a much faster (but slightly less elegant) solution:
window <- 2
seq.Date(min(have$when), max(have$when), by = "day") %>%
purrr::map(function(date) {
have %>%
filter(when <= date, when >= date - days(window - 1)) %>%
summarise(userCount = n_distinct(user)) %>%
mutate(when = date)
}) %>%
bind_rows()
Count unique IDs for each group within x days
An option using non-equi join:
DT[, onemthago := date - 30L]
DT[, count :=
DT[.SD, on=.(group, date>=onemthago, date<=date),
by=.EACHI, length(unique(ID))]$V1
]
output:
group date ID onemthago count
1: G 2014-04-01 2 2014-03-02 1
2: G 2014-04-12 3 2014-03-13 2
3: F 2014-04-07 4 2014-03-08 1
4: G 2014-05-03 2 2014-04-03 2
5: E 2014-04-14 3 2014-03-15 1
6: E 2014-05-04 1 2014-04-04 2
7: H 2014-03-31 2 2014-03-01 1
8: H 2014-04-18 4 2014-03-19 2
9: H 2014-04-23 2 2014-03-24 2
10: A 2014-04-01 1 2014-03-02 1
data:
date = as.Date(c("2014-04-01", "2014-04-12", "2014-04-07", "2014-05-03", "2014-04-14", "2014-05-04", "2014-03-31", "2014-04-18", "2014-04-23", "2014-04-01"))
group = c("G","G","F","G","E","E","H","H","H","A")
ID = c(2, 3, 4, 2, 3, 1, 2, 4, 2, 1)
library(data.table)
DT <- data.table(group, date, ID)
edit to address comment on multiple lookback periods. You can try something like:
for (x in c(30L, 90L)) {
DT[, daysago := date - x]
DT[, paste0("count", x) :=
.SD[.SD, on=.(group, date>=daysago, date<=date),
by=.EACHI, length(unique(ID))]$V1
][]
}
DT
Count observations over rolling 30 day window
With sapply
and between
, count the number of observations prior to the current observation that are within 30 days.
library(lubridate)
library(dplyr)
dat %>%
group_by(id) %>%
mutate(newvar = sapply(seq(length(date)),
function(x) sum(between(date[1:x], date[x] - days(30), date[x]))))
# A tibble: 16 x 4
# Groups: id [2]
id q date newvar
<chr> <dbl> <date> <int>
1 a 1 2021-01-01 1
2 a 1 2021-01-01 2
3 a 1 2021-01-21 3
4 a 1 2021-01-21 4
5 a 1 2021-02-12 3
6 a 1 2021-02-12 4
7 a 1 2021-02-12 5
8 a 1 2021-02-12 6
9 b 1 2021-02-02 1
10 b 1 2021-02-02 2
11 b 1 2021-02-22 3
12 b 1 2021-02-22 4
13 b 1 2021-03-13 3
14 b 1 2021-03-13 4
15 b 1 2021-03-13 5
16 b 1 2021-03-13 6
dplyr Running count of unique entries
This seems to give the result you are after
df %>%
group_by(subjectID) %>%
mutate(
n_tot = row_number(),
n_case=cumsum(!duplicated(caseID))
)
We use duplicated
to see if the case ID is new or not, and then use cumsum()
to get a running count of new cases.
Count occurrence of IDs within the last x days in R
A data.table
option
dt[, date := as.Date(date)][, count := cumsum(date <= first(date) + 30) - 1, group]
gives
> dt
group date count
1: G 2014-04-01 0
2: G 2014-04-12 1
3: F 2014-04-07 0
4: G 2014-05-03 1
5: E 2014-04-14 0
6: E 2014-05-04 1
7: H 2014-03-31 0
8: H 2014-04-18 1
9: H 2014-04-23 2
10: A 2014-04-01 0
A dplyr
option following similar idea
dt %>%
mutate(date = as.Date(date)) %>%
group_by(group) %>%
mutate(count = cumsum(date <= first(date) + 30) - 1) %>%
ungroup()
R: Find the count of each unique value of a variable that occurs within timeframe of each observation
This is only a partial answer, since I did not fully understand your 2nd and 3rd problem...
#create data.table with the correct names, based on your sample data (i think)
DT <- dt[, .(person = Species, date, value = Sepal.Width)]
#set keys
setkey(DT, person, date)
#create unique values of `value in the last year before the observation, for each `person
DT[ DT,
#get the unique values for the last year, suppress immediate output with {}
unique_values_prev_year := {
val = DT[ person == i.person & date %between% c( i.date - lubridate::years(1), i.date) ]$value
unique_val = sort( unique( val ) )
list( paste0( unique_val, collapse = "-" ) )
},
#do the above for each row
by = .EACHI ]
output
# person date value unique_values_prev_year
# 1: setosa 1970-09-14 3.5 3.5
# 2: setosa 1970-09-15 3.0 3-3.5
# 3: setosa 1970-09-16 3.2 3-3.2-3.5
# 4: setosa 1970-09-17 3.1 3-3.1-3.2-3.5
# 5: setosa 1970-09-19 3.9 3-3.1-3.2-3.5-3.9
# ---
# 133: virginica 1970-10-28 3.3 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 134: virginica 1970-10-29 3.0 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 135: virginica 1970-10-30 2.5 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 136: virginica 1970-10-31 3.0 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
# 137: virginica 1970-11-01 3.4 2.2-2.5-2.6-2.7-2.8-2.9-3-3.1-3.2-3.3-3.4-3.6-3.8
r - compute rolling sum by id within specific time frame
Not sure this will be helpful with the dimension of your data.
First, create running index to handle duplicate date and roll sum must not include prev dupe date and also create date one year ago (i would argue that 365 is better but seems like OP wants 366).
Then, perform a non-equi self-join while ensuring prev dupe date not used and dates are within a year.
df[, c("rn", "oneYrAgo") := .(.I, date - 366)]
df[df,
.(roll_sum=.N, flag_sum=sum(flag, na.rm=TRUE)),
on=.(date >= oneYrAgo, rn < rn, id, date <= date),
by=.EACHI][,
-seq_len(2L)]
result:
id date roll_sum flag_sum
1: 1 2012-03-26 0 0
2: 1 2012-04-26 1 1
3: 1 2015-06-27 0 0
4: 1 2016-06-07 1 0
5: 2 2012-06-22 0 0
6: 2 2012-06-22 1 0
7: 2 2012-10-12 2 0
8: 2 2012-10-22 3 1
9: 2 2012-11-05 4 2
10: 2 2012-11-19 5 3
11: 2 2012-11-26 6 4
12: 2 2013-12-12 0 0
13: 2 2013-12-13 1 1
Related Topics
Join Tables Using a Value Inside a JSONb Column
Ruby Activerecord and SQL Tuple Support
Comparison Operator in Pyspark (Not Equal/ !=)
Maximum Length of an SQL Query
Sql Server Queries Case Sensitivity
Bigquery Select _Tables_ from All Tables Within Project
Postgres - Comparing Two Arrays
SQL Split String by Space into Table in Postgresql
Rails - Distinct on After a Join
What Data Can Be Stored in Varbinary Data Type of SQL Server
When Should I Nest Pl/SQL Begin...End Blocks
Using with VS Declare a Temporary Table: Performance/Difference
Rails - Find with Condition in Rails 4
Sql Server: How to Perform Rtrim on All Varchar Columns of a Table
Sql Query to Select Bottom 2 from Each Category
Using a View with No Primary Key with Entity
How to Display the Date as Mm/Dd/Yyyy Hh:Mm Am/Pm Using SQL Server 2008 R2