Identify a Value Changes' Date and Summarize The Data with Sum() and Diff() in R

Identify a value changes' date and summarize the data with sum() and diff() in R

Using data.table:

library(data.table)
setDT(sampleData)

Some Preprocessing:

sampleData[, firstdate := as.Date(date, "%m/%d/%y")]

Based on how you calculate date diff, we are better off creating a range of dates for each row:

sampleData[, lastdate := shift(firstdate,type = "lead"), by = product_id]
sampleData[is.na(lastdate), lastdate := firstdate]
# Arun's one step: sampleData[, lastdate := shift(firstdate, type="lead", fill=firstdate[.N]), by = product_id]

Then create a new ID for every change in price:

sampleData[, price_id := cumsum(c(0,diff(price) != 0)), by = product_id]

Then calculate your groupwise functions, by product and price run:

sampleData[,
.(
price = unique(price),
sum_qty = sum(qty_ordered),
date_diff = max(lastdate) − min(firstdate)
),
by = .(
product_id,
price_id
)
]

product_id price_id price sum_qty date_diff
1: 1000 0 2.490 4 21 days
2: 1000 1 1.743 1 61 days
3: 1000 2 2.490 2 33 days
4: 1002 0 2.093 3 28 days
5: 1002 1 2.110 4 31 days
6: 1002 2 2.970 1 0 days

I think the last price change for 1000 is only 33 days, and the preceding one is 61 (not 60). If you include the first day it is 22, 62 and 34, and the line should read date_diff = max(lastdate) − min(firstdate) + 1

Finding how many times variable changes over multiple days in R

For starters, I assume your data is in variable df.
You have to think carefully about what you want to count.
If you want the number of price changes regardless of the days, you can do this:

df %>% group_by(item) %>% 
summarise(Price_changed_over_all_days =
sum((lead(price) - price)!=0, na.rm = TRUE))

# A tibble: 3 x 2
# item Price_changed_over_all_days
# <chr> <int>
#1 x 24
#2 y 30
#3 z 24

However, if you want to count the number of price changes in particular days, you will get something like this:

df %>% group_by(item, bought_date) %>% 
summarise(Price_changed_in_one_day =
sum((lead(price) - price)!=0, na.rm = TRUE))
# A tibble: 6 x 3
# Groups: item [3]
# item bought_date Price_changed_in_one_day
# <chr> <dttm> <int>
#1 x 2020-09-21 00:00:00 3
#2 x 2020-09-22 00:00:00 20
#3 y 2020-09-21 00:00:00 4
#4 y 2020-09-22 00:00:00 26
#5 z 2020-09-21 00:00:00 5
#6 z 2020-09-22 00:00:00 18

It's just that in this case, you have a lot more rows in the summary table.
If you only want one table, you have to somehow assemble it and decide some stats based on the values for the days. Maybe the average will be appropriate here? I do not know that.

df %>% group_by(item) %>% 
summarise(Price_changed_over_all_days =
sum((lead(price) - price)!=0, na.rm = TRUE)) %>%
left_join(
df %>% group_by(item, bought_date) %>%
summarise(Price_changed_in_one_day =
sum((lead(price) - price)!=0, na.rm = TRUE)) %>%
group_by(item) %>%
summarise(Price_changed_in_one_day =
mean(Price_changed_in_one_day)
), by= "item")
# A tibble: 3 x 3
# item Price_changed_over_all_days Price_changed_in_one_day
# <chr> <int> <dbl>
#1 x 24 11.5
#2 y 30 15
#3 z 24 11.5

Also note that price changes can occur at the turn of the day and therefore the sum of changes in several days for a given product does not have to equal the sum of all price changes for that product. In your case, this is the case for product "x".

Determine when columns of a data.frame change value and return indices of the change

In data.table version 1.8.10 (stable version in CRAN), there's a(n) (unexported) function called duplist that does exactly this. And it's also written in C and is therefore terribly fast.

require(data.table) # 1.8.10
data.table:::duplist(x[, 3:5])
# [1] 1 4 5

If you're using the development version of data.table (1.8.11), then there's a more efficient version (in terms of memory) renamed as uniqlist, that does exactly the same job. Probably this should be exported for next release. Seems to have come up on SO more than once. Let's see.

require(data.table) # 1.8.11
data.table:::uniqlist(x[, 3:5])
# [1] 1 4 5

How to tell if a value changed over dimensions in R

You can use group_by and summarise in dplyr to get the date range and count of changes as columns in a new table:

library(dplyr)
df %>%
group_by(Customer) %>%
summarise(dates = sprintf("%s to %s", min(Date), max(Date)),
change.count = length(unique(Address)) - 1)

Result:

# A tibble: 2 × 3
Customer dates change.count
<chr> <chr> <dbl>
1 Cust1 12/31/14 to 12/31/16 1
2 Cust2 12/31/14 to 12/31/16 1

loop to identify a sum and then get the position of that sum

out <- lapply(EXAMPLE[,-1], cumsum)
names(out) <- paste0(names(out), "_cumulative")
options(width=123, length=99999)
cbind(EXAMPLE, out)
# date new_york berlin tokyo new_york_cumulative berlin_cumulative tokyo_cumulative
# 1 2010 10 0 2 10 0 2
# 2 2011 20 51 15 30 51 17
# 3 2012 22 45 20 52 96 37
# 4 2013 28 12 13 80 108 50

R - Summarize values between specific date range

A couple alternatives to consider. I assume your dates are actual dates and not character values.

You can try using fuzzyjoin to merge the two data.frames, including rows where the dates fall between start_dates and end_dates.

library(tidyverse)
library(fuzzyjoin)

fuzzy_left_join(
date_df,
df,
by = c("start_dates" = "dates", "end_dates" = "dates"),
match_fun = list(`<=`, `>=`)
) %>%
group_by(start_dates, end_dates) %>%
summarise(new_goal_column = sum(x))

Output

  start_dates end_dates  new_goal_column
<date> <date> <dbl>
1 2021-01-01 2021-01-06 19
2 2021-01-07 2021-01-10 6

You can also try using data.table and joining.

library(data.table)

setDT(date_df)
setDT(df)

df[date_df, .(start_dates, end_dates, x), on = .(dates >= start_dates, dates <= end_dates)][
, .(new_goal_column = sum(x)), by = .(start_dates, end_dates)
]

Output

   start_dates  end_dates new_goal_column
1: 2021-01-01 2021-01-06 19
2: 2021-01-07 2021-01-10 6

Summarizing a dataframe by date and group

It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.

hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)

# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)

# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)

> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8

Aggregate Data by date

Here is a solution using the reshape2 (tidyr or reshape could has also been used) package to reform your data frame and the dplyr library to summarize your results:

df <- data.frame(VAL1, D01012016, D02012016, D03022016,D05022016,D03032016,D01042016,D02042016,D03042016,D05042016,D23062016,D05072016,D03082016,D01092016,D12092016)

library(reshape2)
ndf<-melt(df)
ndf$date<-as.Date(ndf$variable, format="D%d%m%Y")

library(dplyr)
summarize(group_by(ndf, VAL1, cut(ndf$date, breaks ="1 month")), sum(value))

It is difficult to work with the your by column format, thus it is easier to convert from the wide format to a long format. VAL1 is carried from the melt command. If you are interested in quarterly results just change from 1 month breaks to three month breaks.

R group by date, and summarize the values

Use as.Date() then aggregate().

energy$Date <- as.Date(energy$Datetime)
aggregate(energy$value, by=list(energy$Date), sum)

EDIT

Emma made a good point about column names. You can preserve column names in aggregate by using the following instead:

aggregate(energy["value"], by=energy["Date"], sum)


Related Topics



Leave a reply



Submit