Given Start Date and End Date, Reshape/Expand Data for Each Day Between (Each Day on a Row)

Given start date and end date, reshape/expand data for each day between (each day on a row)

Using data.table

library(data.table)
setDT(x)[, list(DATE=seq(START_DAY, END_DAY, by = 'day')), PER_COST]
# PER_COST DATE
# 1: 3451380 2013-02-26
# 2: 3451380 2013-02-27
# 3: 3451380 2013-02-28
# 4: 3451380 2013-03-01
# 5: 3451380 2013-03-02
#---
#116: 3575311 2013-06-21
#117: 3575311 2013-06-22
#118: 3575311 2013-06-23
#119: 3575311 2013-06-24
#120: 3575311 2013-06-25

If there are duplicate PER_COST, then it may be better to use 1:nrow(x) as the grouping variable

setDT(x)[, list(DATE=seq(START_DAY, END_DAY, by = 'day'), 
PER_COST=rep(PER_COST, END_DAY-START_DAY+1)), 1:nrow(x)]

Update

Using dplyr

library(dplyr)
x %>%
rowwise() %>%
do(data.frame(DATE=seq(.$START_DAY, .$END_DAY, by='day'),
PER_COST= rep(.$PER_COST, .$END_DAY-.$START_DAY+1)))

Expand rows by date range using start and end date

Using data.table:

require(data.table) ## 1.9.2+
setDT(df)[ , list(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]

# you may use dot notation as a shorthand alias of list in j:
setDT(df)[ , .(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]

setDT converts df to a data.table. Then for each row, by = 1:nrow(df), we create idnum and month as required.

Given only vector of dates, expand data in between (unequal) date points

A possible way:

First, create the sequence of dates you're interested in as a one-column dataframe:

v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))

Next, join with your original dataframe and fill the missing values, for instance using dplyr and zoo:

library(dplyr)
library(zoo)

v %>%
left_join(dd, by = "date") %>%
na.locf

NB: I suppose that your dataframe dd actually contains dates (and not factors).

dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))

Expanding rows by date whilst keeping all other variables

I apologise wholeheartedly, but I have found a previous question that answers my question! I had searched on stackoverflow for a good hour before posting my question and couldn't find what I was looking for. This link to the related question is: R -- Expand date range into panel data by group

For anyone that may be interested, I used the following code (note: this code is for my complex dataset, not the sample dataset I used in my question):

f <- function(x) with(x, data.frame(dyadid, extraterritorial, rebpolwing,
rebpolwinglegal, rebestimate, rebstrength,
centcontrol, strengthcent, mobcap, armsproc,
fightcap, terrcont, terrname, effterrcont,
conflicttype, transconstsupp, rebextpart,
rebpresosts, presname, rebel.support,
rtypesup, rsupname, gov.support, gtypesup,
gsupname, govextpart,
date = seq(start_year_month, end_year_month, by = "month")))

NSA2 <- do.call("rbind", by(NSA1, 1:nrow(NSA1), f))

Start and end dates of time periods defined by a column in a data frame

Maybe something like:

library(data.table)
setDT(Yrs)[, .(StartDate=Date[Var==3L], EndDate=Date[Var==7L]),
by=.(c(0L, cumsum(diff(Var) < 1L)))][, -1L]

output:

             StartDate             EndDate
1: 2019-02-04 03:00:00 2019-02-04 07:00:00
2: 2019-02-04 15:00:00 2019-02-04 19:00:00

What is wrong with this code to generate new date rows in my Pandas dataframe based on start and end dates?

Manipulations where you explode from a date range to a single row per date tend to be slow. Here's a straight-forward method that uses a slow apply to create the daterange and then uses explode to turn that into a row for every date.

import pandas a pd
#df[['StartDate', 'EndDate']] = df[['StartDate', 'EndDate']].apply(pd.to_datetime)

df['Date'] = df.apply(lambda r: pd.date_range(r.StartDate, r.EndDate, freq='D'), axis=1)

df = (df.assign(Active=1).drop(columns=['StartDate', 'EndDate']).explode('Date')
.set_index(['ObjectID', 'WatchListID', 'Date']))

print(df)
# Active
#ObjectID WatchListID Date
#101 1 2017-01-01 1
# 2017-01-02 1
# 2017-01-03 1
# 2017-01-04 1
# 2017-01-05 1
#... ...
#103 3 2017-03-28 1
# 2017-03-29 1
# 2017-03-30 1
# 2017-03-31 1
# 2017-04-01 1
#[1249 rows x 1 columns]

The above is straight-forward, but slow because of the apply. Instead, if performance is key we can use some more clever methods to achieve the same. Since you're keeping all the information static and just incrementing the day we can achieve this using Series.repeat and then adding 1 day to each row within the group. The addition can be achieved with an optimized gropuby + cumsum and add with pd.to_timedelta.

# Repeat all static information
df = (df.set_index([*df]).assign(Active=1)['Active']
.repeat((df['EndDate'] - df['StartDate']).dt.days + 1)
.reset_index().drop(columns='EndDate'))

# Increment the Day
df['Date'] = df['StartDate'] + pd.to_timedelta((df.groupby(['ObjectID', 'WatchListID', 'StartDate'])
['Active'].cumsum()-1), unit='D')
# Clean up columns, create MultiIndex
df = df.drop(columns='StartDate').set_index(['ObjectID', 'WatchListID', 'Date'])

Timings:

import perfplot
import pandas as pd
import numpy as np

def repeat_addtimedelta(df):
df = (df.set_index([*df]).assign(Active=1)['Active']
.repeat((df['EndDate'] - df['StartDate']).dt.days + 1)
.reset_index().drop(columns='EndDate'))

df['Date'] = df['StartDate'] + pd.to_timedelta((df.groupby(['ObjectID', 'WatchListID', 'StartDate'])
['Active'].cumsum()-1), unit='D')

df = df.drop(columns='StartDate').set_index(['ObjectID', 'WatchListID', 'Date'])

return df

def apply_explode(df):
df['Date'] = df.apply(lambda r: pd.date_range(r.StartDate, r.EndDate, freq='D'), axis=1)

df = (df.assign(Active=1).drop(columns=['StartDate', 'EndDate']).explode('Date')
.set_index(['ObjectID', 'WatchListID', 'Date']))

return df

perfplot.show(
setup=lambda n: pd.DataFrame({'ObjectID': range(n), 'WatchListID': 100+np.arange(n),
'StartDate': pd.date_range('2010-01-01', freq='D', periods=n),
'EndDate': pd.date_range('2010-06-01', freq='4D', periods=n)}),
kernels=[
lambda df: repeat_addtimedelta(df),
lambda df: apply_explode(df),
],
labels=['Repeat+Add', 'Apply+Explode'],
n_range=[2 ** k for k in range(1, 12)],
equality_check=lambda x,y: x.compare(y).empty,
xlabel='len(df)'
)

Sample Image

Changing date range into series of dates (wide to long)

This way?

library(tidyverse)

dat %>%
group_by(Subject, Period, Dose) %>%
summarize(Day = list(seq(Start, End, by = 'day'))) %>%
unnest(Day) %>%
mutate(Dose = cumsum(Dose)) %>%
ungroup()

Output:

# A tibble: 392 x 4
Subject Period Dose Day
<fct> <fct> <dbl> <date>
1 13434 MAD 400 2017-04-18
2 13434 MAD 800 2017-04-19
3 13434 MAD 1200 2017-04-20
4 13434 MAD 1600 2017-04-21
5 13434 MAD 2000 2017-04-22
6 13434 MAD 2400 2017-04-23
7 13434 MAD 2800 2017-04-24
8 13434 MAD 3200 2017-04-25
9 13434 MAD 3600 2017-04-26
10 13434 MAD 4000 2017-04-27
# ... with 382 more rows

I assume that tuples (Subject, Period, Dose) are unique. If not you can add grouping by Start End.

And the 'ideal-world', might be approached this way:

dat %>%
group_by(Subject, Period, Dose) %>%
summarize(Day = list(seq(Start, End, by = 'day'))) %>%
unnest(Day) %>%
group_by(Subject) %>%
arrange(Day) %>%
mutate(Dose = cumsum(Dose)) %>%
ungroup()

If we add following line to code above:

... %>% filter(Day >= as.Date("2018-12-11"), Day <= as.Date("2018-12-12"), 
Subject == "22222")

It will output:

  Subject Period   Dose Day       
<fct> <fct> <dbl> <date>
1 22222 OSE 102000 2018-12-11
2 22222 OSE 103200 2018-12-12

So it seems that it correctly calculates the cumsum (adding 1200, which is the next dose for the next period) for the periods which follows one aft another.

Get all possible combinations in a time-series data with variable daily readings

We could filter where the 'record' is greater than 1, group_split by 'row_number' and 'date', then bind the rows with the filtered data where the 'record' is 1

library(dplyr)
library(purrr)
out <- consumption %>%
filter(n > 1) %>%
group_split(date, rn = row_number()) %>%
map(~ bind_rows(consumption %>%
filter(n == 1), .x %>%
select(-rn)) %>%
arrange(date))

-output

> out
[[1]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1

[[2]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1

[[3]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1

With the updated data, we create the row_number(), then split it by 'date' column (as in @ThomasIsCoding solution), use crossing (from purrr) to expand the data, and loop over the rows with pmap, slice the rows of the original data based on the row index

library(tidyr)
library(tibble)
consumption %>%
transmute(date, rn = row_number()) %>%
deframe %>%
split(names(.)) %>%
invoke(crossing, .) %>%
pmap(~ consumption %>%
slice(c(...))) %>%
unname

-output

[[1]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1

[[2]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2

[[3]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1

[[4]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2

[[5]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1

[[6]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2


Related Topics



Leave a reply



Submit