Given Start Date and End Date, Reshape/Expand Data for Each Day Between (Each Day on a Row)

Given start date and end date, reshape/expand data for each day between (each day on a row)

Using data.table

library(data.table)
setDT(x)[, list(DATE=seq(START_DAY, END_DAY, by = 'day')), PER_COST]
#    PER_COST       DATE
# 1:  3451380 2013-02-26
# 2:  3451380 2013-02-27
# 3:  3451380 2013-02-28
# 4:  3451380 2013-03-01
# 5:  3451380 2013-03-02
#---                    
#116:  3575311 2013-06-21
#117:  3575311 2013-06-22
#118:  3575311 2013-06-23
#119:  3575311 2013-06-24
#120:  3575311 2013-06-25

If there are duplicate PER_COST, then it may be better to use 1:nrow(x) as the grouping variable

setDT(x)[, list(DATE=seq(START_DAY, END_DAY, by = 'day'), 
      PER_COST=rep(PER_COST, END_DAY-START_DAY+1)), 1:nrow(x)]

Update

Using dplyr

library(dplyr)
  x %>% 
    rowwise() %>% 
    do(data.frame(DATE=seq(.$START_DAY, .$END_DAY, by='day'),
       PER_COST= rep(.$PER_COST, .$END_DAY-.$START_DAY+1)))

Expand rows by date range using start and end date

Using data.table:

require(data.table) ## 1.9.2+
setDT(df)[ , list(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]

# you may use dot notation as a shorthand alias of list in j:
setDT(df)[ , .(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]

setDT converts df to a data.table. Then for each row, by = 1:nrow(df), we create idnum and month as required.

Given only vector of dates, expand data in between (unequal) date points

A possible way:

First, create the sequence of dates you're interested in as a one-column dataframe:

v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))

Next, join with your original dataframe and fill the missing values, for instance using dplyr and zoo:

library(dplyr)
library(zoo)

v %>% 
  left_join(dd, by = "date") %>% 
  na.locf

NB: I suppose that your dataframe dd actually contains dates (and not factors).

dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))

Expanding rows by date whilst keeping all other variables

I apologise wholeheartedly, but I have found a previous question that answers my question! I had searched on stackoverflow for a good hour before posting my question and couldn't find what I was looking for. This link to the related question is: R -- Expand date range into panel data by group

For anyone that may be interested, I used the following code (note: this code is for my complex dataset, not the sample dataset I used in my question):

f <- function(x) with(x, data.frame(dyadid, extraterritorial, rebpolwing,
                                    rebpolwinglegal, rebestimate, rebstrength,
                                    centcontrol, strengthcent, mobcap, armsproc,
                                    fightcap, terrcont, terrname, effterrcont,
                                    conflicttype, transconstsupp, rebextpart,
                                    rebpresosts, presname, rebel.support,
                                    rtypesup, rsupname, gov.support, gtypesup,
                                    gsupname, govextpart,
                                    date = seq(start_year_month, end_year_month, by = "month")))

NSA2 <- do.call("rbind", by(NSA1, 1:nrow(NSA1), f))

Start and end dates of time periods defined by a column in a data frame

Maybe something like:

library(data.table)
setDT(Yrs)[, .(StartDate=Date[Var==3L], EndDate=Date[Var==7L]), 
    by=.(c(0L, cumsum(diff(Var) < 1L)))][, -1L]

output:

             StartDate             EndDate
1: 2019-02-04 03:00:00 2019-02-04 07:00:00
2: 2019-02-04 15:00:00 2019-02-04 19:00:00

What is wrong with this code to generate new date rows in my Pandas dataframe based on start and end dates?

Manipulations where you explode from a date range to a single row per date tend to be slow. Here's a straight-forward method that uses a slow apply to create the daterange and then uses explode to turn that into a row for every date.

import pandas a pd
#df[['StartDate', 'EndDate']] = df[['StartDate', 'EndDate']].apply(pd.to_datetime)

df['Date'] = df.apply(lambda r: pd.date_range(r.StartDate, r.EndDate, freq='D'), axis=1)

df = (df.assign(Active=1).drop(columns=['StartDate', 'EndDate']).explode('Date')
        .set_index(['ObjectID', 'WatchListID', 'Date']))

print(df)
#                                 Active
#ObjectID WatchListID Date              
#101      1           2017-01-01       1
#                     2017-01-02       1
#                     2017-01-03       1
#                     2017-01-04       1
#                     2017-01-05       1
#...                                 ...
#103      3           2017-03-28       1
#                     2017-03-29       1
#                     2017-03-30       1
#                     2017-03-31       1
#                     2017-04-01       1 
#[1249 rows x 1 columns]

The above is straight-forward, but slow because of the apply. Instead, if performance is key we can use some more clever methods to achieve the same. Since you're keeping all the information static and just incrementing the day we can achieve this using Series.repeat and then adding 1 day to each row within the group. The addition can be achieved with an optimized gropuby + cumsum and add with pd.to_timedelta.

# Repeat all static information
df = (df.set_index([*df]).assign(Active=1)['Active']
        .repeat((df['EndDate'] - df['StartDate']).dt.days + 1)
        .reset_index().drop(columns='EndDate'))

# Increment the Day
df['Date'] = df['StartDate'] + pd.to_timedelta((df.groupby(['ObjectID', 'WatchListID', 'StartDate'])
                                               ['Active'].cumsum()-1), unit='D')
# Clean up columns, create MultiIndex
df = df.drop(columns='StartDate').set_index(['ObjectID', 'WatchListID', 'Date'])

Timings:

import perfplot
import pandas as pd
import numpy as np

def repeat_addtimedelta(df):
    df = (df.set_index([*df]).assign(Active=1)['Active']
            .repeat((df['EndDate'] - df['StartDate']).dt.days + 1)
            .reset_index().drop(columns='EndDate'))

    df['Date'] = df['StartDate'] + pd.to_timedelta((df.groupby(['ObjectID', 'WatchListID', 'StartDate'])
                                                   ['Active'].cumsum()-1), unit='D')
    
    df = df.drop(columns='StartDate').set_index(['ObjectID', 'WatchListID', 'Date'])
    
    return df

def apply_explode(df):
    df['Date'] = df.apply(lambda r: pd.date_range(r.StartDate, r.EndDate, freq='D'), axis=1)

    df = (df.assign(Active=1).drop(columns=['StartDate', 'EndDate']).explode('Date')
            .set_index(['ObjectID', 'WatchListID', 'Date']))

    return df

perfplot.show(
    setup=lambda n: pd.DataFrame({'ObjectID': range(n), 'WatchListID': 100+np.arange(n), 
                                  'StartDate': pd.date_range('2010-01-01', freq='D', periods=n),
                                  'EndDate': pd.date_range('2010-06-01', freq='4D', periods=n)}),
    kernels=[
        lambda df: repeat_addtimedelta(df),
        lambda df: apply_explode(df),
    ],
    labels=['Repeat+Add', 'Apply+Explode'],
    n_range=[2 ** k for k in range(1, 12)],
    equality_check=lambda x,y: x.compare(y).empty,  
    xlabel='len(df)'
)

Sample Image

Changing date range into series of dates (wide to long)

This way?

library(tidyverse)

dat %>%
  group_by(Subject, Period, Dose) %>%
  summarize(Day = list(seq(Start, End, by = 'day'))) %>% 
  unnest(Day) %>%
  mutate(Dose = cumsum(Dose)) %>%
  ungroup()

Output:

# A tibble: 392 x 4
   Subject Period  Dose Day       
   <fct>   <fct>  <dbl> <date>    
 1 13434   MAD      400 2017-04-18
 2 13434   MAD      800 2017-04-19
 3 13434   MAD     1200 2017-04-20
 4 13434   MAD     1600 2017-04-21
 5 13434   MAD     2000 2017-04-22
 6 13434   MAD     2400 2017-04-23
 7 13434   MAD     2800 2017-04-24
 8 13434   MAD     3200 2017-04-25
 9 13434   MAD     3600 2017-04-26
10 13434   MAD     4000 2017-04-27
# ... with 382 more rows

I assume that tuples (Subject, Period, Dose) are unique. If not you can add grouping by Start End.

And the 'ideal-world', might be approached this way:

dat %>%
  group_by(Subject, Period, Dose) %>%
  summarize(Day = list(seq(Start, End, by = 'day'))) %>% 
  unnest(Day) %>%
  group_by(Subject) %>%
  arrange(Day) %>%
  mutate(Dose = cumsum(Dose)) %>%
  ungroup()

If we add following line to code above:

... %>% filter(Day >= as.Date("2018-12-11"), Day <= as.Date("2018-12-12"), 
               Subject == "22222")

It will output:

  Subject Period   Dose Day       
  <fct>   <fct>   <dbl> <date>    
1 22222   OSE    102000 2018-12-11
2 22222   OSE    103200 2018-12-12

So it seems that it correctly calculates the cumsum (adding 1200, which is the next dose for the next period) for the periods which follows one aft another.

Get all possible combinations in a time-series data with variable daily readings

We could filter where the 'record' is greater than 1, group_split by 'row_number' and 'date', then bind the rows with the filtered data where the 'record' is 1

library(dplyr)
library(purrr)
out <- consumption %>% 
   filter(n > 1) %>%
    group_split(date, rn = row_number()) %>% 
    map(~ bind_rows(consumption %>%
          filter(n == 1), .x %>%
             select(-rn)) %>% 
         arrange(date))

-output

> out
[[1]]
# A tibble: 4 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-04    40     1      1

[[2]]
# A tibble: 4 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    32     3      2
4 2020-06-04    40     1      1

[[3]]
# A tibble: 4 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    33     3      3
4 2020-06-04    40     1      1

With the updated data, we create the row_number(), then split it by 'date' column (as in @ThomasIsCoding solution), use crossing (from purrr) to expand the data, and loop over the rows with pmap, slice the rows of the original data based on the row index

library(tidyr)
library(tibble)
consumption %>%
     transmute(date, rn = row_number()) %>%
     deframe %>%
     split(names(.)) %>%
     invoke(crossing, .) %>%
     pmap(~ consumption %>% 
        slice(c(...))) %>%
     unname

-output

[[1]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[2]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    31     3      1
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

[[3]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    32     3      2
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[4]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    32     3      2
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

[[5]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    33     3      3
4 2020-06-04    40     1      1
5 2020-06-05    51     2      1

[[6]]
# A tibble: 5 x 4
  date         val     n record
  <date>     <dbl> <int>  <int>
1 2020-06-01    10     1      1
2 2020-06-02    20     1      1
3 2020-06-03    33     3      3
4 2020-06-04    40     1      1
5 2020-06-05    52     2      2

Given Start Date and End Date, Reshape/Expand Data for Each Day Between (Each Day on a Row)