Given start date and end date, reshape/expand data for each day between (each day on a row)
Using data.table
library(data.table)
setDT(x)[, list(DATE=seq(START_DAY, END_DAY, by = 'day')), PER_COST]
# PER_COST DATE
# 1: 3451380 2013-02-26
# 2: 3451380 2013-02-27
# 3: 3451380 2013-02-28
# 4: 3451380 2013-03-01
# 5: 3451380 2013-03-02
#---
#116: 3575311 2013-06-21
#117: 3575311 2013-06-22
#118: 3575311 2013-06-23
#119: 3575311 2013-06-24
#120: 3575311 2013-06-25
If there are duplicate PER_COST
, then it may be better to use 1:nrow(x)
as the grouping variable
setDT(x)[, list(DATE=seq(START_DAY, END_DAY, by = 'day'),
PER_COST=rep(PER_COST, END_DAY-START_DAY+1)), 1:nrow(x)]
Update
Using dplyr
library(dplyr)
x %>%
rowwise() %>%
do(data.frame(DATE=seq(.$START_DAY, .$END_DAY, by='day'),
PER_COST= rep(.$PER_COST, .$END_DAY-.$START_DAY+1)))
Expand rows by date range using start and end date
Using data.table
:
require(data.table) ## 1.9.2+
setDT(df)[ , list(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]
# you may use dot notation as a shorthand alias of list in j:
setDT(df)[ , .(idnum = idnum, month = seq(start, end, by = "month")), by = 1:nrow(df)]
setDT
converts df
to a data.table
. Then for each row, by = 1:nrow(df)
, we create idnum
and month
as required.
Given only vector of dates, expand data in between (unequal) date points
A possible way:
First, create the sequence of dates you're interested in as a one-column dataframe:
v <- data.frame(date = seq(min(dd$date), as.Date("1999-04-10"), by="day"))
Next, join with your original dataframe and fill the missing values, for instance using dplyr
and zoo
:
library(dplyr)
library(zoo)
v %>%
left_join(dd, by = "date") %>%
na.locf
NB: I suppose that your dataframe dd
actually contains dates (and not factors).
dd <- data.frame(date=as.Date(c("1999-03-22","1999-03-29","1999-04-08")),work=c(43,95,92),cumwork=c(43,138,230))
Expanding rows by date whilst keeping all other variables
I apologise wholeheartedly, but I have found a previous question that answers my question! I had searched on stackoverflow for a good hour before posting my question and couldn't find what I was looking for. This link to the related question is: R -- Expand date range into panel data by group
For anyone that may be interested, I used the following code (note: this code is for my complex dataset, not the sample dataset I used in my question):
f <- function(x) with(x, data.frame(dyadid, extraterritorial, rebpolwing,
rebpolwinglegal, rebestimate, rebstrength,
centcontrol, strengthcent, mobcap, armsproc,
fightcap, terrcont, terrname, effterrcont,
conflicttype, transconstsupp, rebextpart,
rebpresosts, presname, rebel.support,
rtypesup, rsupname, gov.support, gtypesup,
gsupname, govextpart,
date = seq(start_year_month, end_year_month, by = "month")))
NSA2 <- do.call("rbind", by(NSA1, 1:nrow(NSA1), f))
Start and end dates of time periods defined by a column in a data frame
Maybe something like:
library(data.table)
setDT(Yrs)[, .(StartDate=Date[Var==3L], EndDate=Date[Var==7L]),
by=.(c(0L, cumsum(diff(Var) < 1L)))][, -1L]
output:
StartDate EndDate
1: 2019-02-04 03:00:00 2019-02-04 07:00:00
2: 2019-02-04 15:00:00 2019-02-04 19:00:00
What is wrong with this code to generate new date rows in my Pandas dataframe based on start and end dates?
Manipulations where you explode from a date range to a single row per date tend to be slow. Here's a straight-forward method that uses a slow apply to create the daterange
and then uses explode
to turn that into a row for every date.
import pandas a pd
#df[['StartDate', 'EndDate']] = df[['StartDate', 'EndDate']].apply(pd.to_datetime)
df['Date'] = df.apply(lambda r: pd.date_range(r.StartDate, r.EndDate, freq='D'), axis=1)
df = (df.assign(Active=1).drop(columns=['StartDate', 'EndDate']).explode('Date')
.set_index(['ObjectID', 'WatchListID', 'Date']))
print(df)
# Active
#ObjectID WatchListID Date
#101 1 2017-01-01 1
# 2017-01-02 1
# 2017-01-03 1
# 2017-01-04 1
# 2017-01-05 1
#... ...
#103 3 2017-03-28 1
# 2017-03-29 1
# 2017-03-30 1
# 2017-03-31 1
# 2017-04-01 1
#[1249 rows x 1 columns]
The above is straight-forward, but slow because of the apply
. Instead, if performance is key we can use some more clever methods to achieve the same. Since you're keeping all the information static and just incrementing the day we can achieve this using Series.repeat
and then adding 1 day to each row within the group. The addition can be achieved with an optimized gropuby
+ cumsum
and add with pd.to_timedelta
.
# Repeat all static information
df = (df.set_index([*df]).assign(Active=1)['Active']
.repeat((df['EndDate'] - df['StartDate']).dt.days + 1)
.reset_index().drop(columns='EndDate'))
# Increment the Day
df['Date'] = df['StartDate'] + pd.to_timedelta((df.groupby(['ObjectID', 'WatchListID', 'StartDate'])
['Active'].cumsum()-1), unit='D')
# Clean up columns, create MultiIndex
df = df.drop(columns='StartDate').set_index(['ObjectID', 'WatchListID', 'Date'])
Timings:
import perfplot
import pandas as pd
import numpy as np
def repeat_addtimedelta(df):
df = (df.set_index([*df]).assign(Active=1)['Active']
.repeat((df['EndDate'] - df['StartDate']).dt.days + 1)
.reset_index().drop(columns='EndDate'))
df['Date'] = df['StartDate'] + pd.to_timedelta((df.groupby(['ObjectID', 'WatchListID', 'StartDate'])
['Active'].cumsum()-1), unit='D')
df = df.drop(columns='StartDate').set_index(['ObjectID', 'WatchListID', 'Date'])
return df
def apply_explode(df):
df['Date'] = df.apply(lambda r: pd.date_range(r.StartDate, r.EndDate, freq='D'), axis=1)
df = (df.assign(Active=1).drop(columns=['StartDate', 'EndDate']).explode('Date')
.set_index(['ObjectID', 'WatchListID', 'Date']))
return df
perfplot.show(
setup=lambda n: pd.DataFrame({'ObjectID': range(n), 'WatchListID': 100+np.arange(n),
'StartDate': pd.date_range('2010-01-01', freq='D', periods=n),
'EndDate': pd.date_range('2010-06-01', freq='4D', periods=n)}),
kernels=[
lambda df: repeat_addtimedelta(df),
lambda df: apply_explode(df),
],
labels=['Repeat+Add', 'Apply+Explode'],
n_range=[2 ** k for k in range(1, 12)],
equality_check=lambda x,y: x.compare(y).empty,
xlabel='len(df)'
)
Changing date range into series of dates (wide to long)
This way?
library(tidyverse)
dat %>%
group_by(Subject, Period, Dose) %>%
summarize(Day = list(seq(Start, End, by = 'day'))) %>%
unnest(Day) %>%
mutate(Dose = cumsum(Dose)) %>%
ungroup()
Output:
# A tibble: 392 x 4
Subject Period Dose Day
<fct> <fct> <dbl> <date>
1 13434 MAD 400 2017-04-18
2 13434 MAD 800 2017-04-19
3 13434 MAD 1200 2017-04-20
4 13434 MAD 1600 2017-04-21
5 13434 MAD 2000 2017-04-22
6 13434 MAD 2400 2017-04-23
7 13434 MAD 2800 2017-04-24
8 13434 MAD 3200 2017-04-25
9 13434 MAD 3600 2017-04-26
10 13434 MAD 4000 2017-04-27
# ... with 382 more rows
I assume that tuples (Subject, Period, Dose)
are unique. If not you can add grouping by Start
End
.
And the 'ideal-world', might be approached this way:
dat %>%
group_by(Subject, Period, Dose) %>%
summarize(Day = list(seq(Start, End, by = 'day'))) %>%
unnest(Day) %>%
group_by(Subject) %>%
arrange(Day) %>%
mutate(Dose = cumsum(Dose)) %>%
ungroup()
If we add following line to code above:
... %>% filter(Day >= as.Date("2018-12-11"), Day <= as.Date("2018-12-12"),
Subject == "22222")
It will output:
Subject Period Dose Day
<fct> <fct> <dbl> <date>
1 22222 OSE 102000 2018-12-11
2 22222 OSE 103200 2018-12-12
So it seems that it correctly calculates the cumsum
(adding 1200, which is the next dose for the next period) for the periods which follows one aft another.
Get all possible combinations in a time-series data with variable daily readings
We could filter
where the 'record' is greater than 1, group_split
by 'row_number' and 'date', then bind the rows with the filter
ed data where the 'record' is 1
library(dplyr)
library(purrr)
out <- consumption %>%
filter(n > 1) %>%
group_split(date, rn = row_number()) %>%
map(~ bind_rows(consumption %>%
filter(n == 1), .x %>%
select(-rn)) %>%
arrange(date))
-output
> out
[[1]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
[[2]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
[[3]]
# A tibble: 4 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
With the updated data, we create the row_number()
, then split
it by 'date' column (as in @ThomasIsCoding solution), use crossing
(from purrr
) to expand the data, and loop over the rows with pmap
, slice
the rows of the original data based on the row index
library(tidyr)
library(tibble)
consumption %>%
transmute(date, rn = row_number()) %>%
deframe %>%
split(names(.)) %>%
invoke(crossing, .) %>%
pmap(~ consumption %>%
slice(c(...))) %>%
unname
-output
[[1]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[2]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 31 3 1
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[3]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[4]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 32 3 2
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
[[5]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 51 2 1
[[6]]
# A tibble: 5 x 4
date val n record
<date> <dbl> <int> <int>
1 2020-06-01 10 1 1
2 2020-06-02 20 1 1
3 2020-06-03 33 3 3
4 2020-06-04 40 1 1
5 2020-06-05 52 2 2
Related Topics
Why Does Is.Vector() Return True for List
Using R to Do a Regression with Multiple Dependent and Multiple Independent Variables
Max and Min Functions That Are Similar to Colmeans
"Unpacking" a Factor List from a Data.Frame
Multiple Y Axis for Bar Plot and Line Graph Using Ggplot
Purrr:Map and Glm - Issues with Call
Get Data Frame from Character Variable
Aggregate by Multiple Columns and Reshape from Long to Wide
Significance Level Added to Matrix Correlation Heatmap Using Ggplot2
Nas Are Not Allowed in Subscripted Assignments
How to Plot a List of Vectors with Different Lengths
R - How to Get a Value of a Multi-Dimensional Array by a Vector of Indices
Combine Multiple PDF Plots into One File
Create a Histogram for Weighted Values
Highlight a Line in Ggplot with Multiple Lines
Splitting String Between Capital and Lowercase Character in R