Identify a value changes' date and summarize the data with sum() and diff() in R
Using data.table
:
library(data.table)
setDT(sampleData)
Some Preprocessing:
sampleData[, firstdate := as.Date(date, "%m/%d/%y")]
Based on how you calculate date diff, we are better off creating a range of dates for each row:
sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]
Then create a new ID for every change in price:
sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]
Then calculate your groupwise functions, by product and price run:
sampleData[,
.(
price = unique(price),
sum_qty = sum(qty_ordered),
date_diff = max(lastdate) − min(firstdate)
),
by = .(
product_id,
price_id
)
]
product_id price_id price sum_qty date_diff
1: 1000 0 2.490 4 21 days
2: 1000 1 1.743 1 61 days
3: 1000 2 2.490 2 33 days
4: 1002 0 2.093 3 28 days
5: 1002 1 2.110 4 31 days
6: 1002 2 2.970 1 0 days
I think the last price change for 1000
is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1
Finding how many times variable changes over multiple days in R
For starters, I assume your data is in variable df
.
You have to think carefully about what you want to count.
If you want the number of price changes regardless of the days, you can do this:
df %>% group_by(item) %>%
summarise(Price_changed_over_all_days =
sum((lead(price) - price)!=0, na.rm = TRUE))
# A tibble: 3 x 2
# item Price_changed_over_all_days
# <chr> <int>
#1 x 24
#2 y 30
#3 z 24
However, if you want to count the number of price changes in particular days, you will get something like this:
df %>% group_by(item, bought_date) %>%
summarise(Price_changed_in_one_day =
sum((lead(price) - price)!=0, na.rm = TRUE))
# A tibble: 6 x 3
# Groups: item [3]
# item bought_date Price_changed_in_one_day
# <chr> <dttm> <int>
#1 x 2020-09-21 00:00:00 3
#2 x 2020-09-22 00:00:00 20
#3 y 2020-09-21 00:00:00 4
#4 y 2020-09-22 00:00:00 26
#5 z 2020-09-21 00:00:00 5
#6 z 2020-09-22 00:00:00 18
It's just that in this case, you have a lot more rows in the summary table.
If you only want one table, you have to somehow assemble it and decide some stats based on the values for the days. Maybe the average will be appropriate here? I do not know that.
df %>% group_by(item) %>%
summarise(Price_changed_over_all_days =
sum((lead(price) - price)!=0, na.rm = TRUE)) %>%
left_join(
df %>% group_by(item, bought_date) %>%
summarise(Price_changed_in_one_day =
sum((lead(price) - price)!=0, na.rm = TRUE)) %>%
group_by(item) %>%
summarise(Price_changed_in_one_day =
mean(Price_changed_in_one_day)
), by= "item")
# A tibble: 3 x 3
# item Price_changed_over_all_days Price_changed_in_one_day
# <chr> <int> <dbl>
#1 x 24 11.5
#2 y 30 15
#3 z 24 11.5
Also note that price changes can occur at the turn of the day and therefore the sum of changes in several days for a given product does not have to equal the sum of all price changes for that product. In your case, this is the case for product "x".
Determine when columns of a data.frame change value and return indices of the change
In data.table
version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist
that does exactly this. And it's also written in C and is therefore terribly fast.
require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5])
# [1] 1 4 5
If you're using the development version of data.table
(1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist
, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.
require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5
How to tell if a value changed over dimensions in R
You can use group_by
and summarise
in dplyr
to get the date range and count of changes as columns in a new table:
library(dplyr)
df %>%
group_by(Customer) %>%
summarise(dates = sprintf("%s to %s", min(Date), max(Date)),
change.count = length(unique(Address)) - 1)
Result:
# A tibble: 2 × 3
Customer dates change.count
<chr> <chr> <dbl>
1 Cust1 12/31/14 to 12/31/16 1
2 Cust2 12/31/14 to 12/31/16 1
loop to identify a sum and then get the position of that sum
out <- lapply(EXAMPLE[,-1], cumsum)
names(out) <- paste0(names(out), "_cumulative")
options(width=123, length=99999)
cbind(EXAMPLE, out)
# date new_york berlin tokyo new_york_cumulative berlin_cumulative tokyo_cumulative
# 1 2010 10 0 2 10 0 2
# 2 2011 20 51 15 30 51 17
# 3 2012 22 45 20 52 96 37
# 4 2013 28 12 13 80 108 50
R - Summarize values between specific date range
A couple alternatives to consider. I assume your dates are actual dates and not character values.
You can try using fuzzyjoin
to merge the two data.frames, including rows where the dates
fall between start_dates
and end_dates
.
library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(
date_df,
df,
by = c("start_dates" = "dates", "end_dates" = "dates"),
match_fun = list(`<=`, `>=`)
) %>%
group_by(start_dates, end_dates) %>%
summarise(new_goal_column = sum(x))
Output
start_dates end_dates new_goal_column
<date> <date> <dbl>
1 2021-01-01 2021-01-06 19
2 2021-01-07 2021-01-10 6
You can also try using data.table
and joining.
library(data.table)
setDT(date_df)
setDT(df)
df[date_df, .(start_dates, end_dates, x), on = .(dates >= start_dates, dates <= end_dates)][
, .(new_goal_column = sum(x)), by = .(start_dates, end_dates)
]
Output
start_dates end_dates new_goal_column
1: 2021-01-01 2021-01-06 19
2: 2021-01-07 2021-01-10 6
Summarizing a dataframe by date and group
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Aggregate Data by date
Here is a solution using the reshape2 (tidyr or reshape could has also been used) package to reform your data frame and the dplyr library to summarize your results:
df <- data.frame(VAL1, D01012016, D02012016, D03022016,D05022016,D03032016,D01042016,D02042016,D03042016,D05042016,D23062016,D05072016,D03082016,D01092016,D12092016)
library(reshape2)
ndf<-melt(df)
ndf$date<-as.Date(ndf$variable, format="D%d%m%Y")
library(dplyr)
summarize(group_by(ndf, VAL1, cut(ndf$date, breaks ="1 month")), sum(value))
It is difficult to work with the your by column format, thus it is easier to convert from the wide format to a long format. VAL1 is carried from the melt command. If you are interested in quarterly results just change from 1 month breaks to three month breaks.
R group by date, and summarize the values
Use as.Date()
then aggregate()
.
energy$Date <- as.Date(energy$Datetime)
aggregate(energy$value, by=list(energy$Date), sum)
EDIT
Emma made a good point about column names. You can preserve column names in aggregate
by using the following instead:
aggregate(energy["value"], by=energy["Date"], sum)
Related Topics
Margins Between Plots in Grid.Arrange
Ifelse Assignment in Data.Table
How to Keep Track of Total Transaction Amount Sent from an Account Each Last 6 Month
Ggplot2: Making Changes to Symbols in The Legend
Center Error Bars (Geom_Errorbar) Horizontally on Bars (Geom_Bar)
Using The Result of Summarise (Dplyr) to Mutate The Original Dataframe
R: How to Filter a Timestamp by Hour and Minute
How to Predict Survival Probabilities in R
Order Dataframe for Given Columns
Using Discrete Custom Color in a Plotly Heatmap
Ggplot Legend Showing Transparency and Fill Color
Devtools::Install_Git Over Ssh
How to Subset a Table Object in R
Modifying Plot in Ggplot2 Using As.Yearmon from Zoo
What's The Difference Between [1], [1,], [,1], [[1]] for a Dataframe in R
Remove Blank Lines from Plot Geom_Tile Ggplot
Using Recordlinkage to Add a Column with a Number for Each Person