Aggregate by specific year in R
You can use the subset
argument in aggregate
aggregate(cbind(temp,pressure,wind) ~ day + month + year, df,
subset=year %in% 2005:2014, mean)
Grouping by year in R
Here is one way using base R :
#Get day of the year
dat$day_in_year <- as.integer(format(dat$date, "%j"))
#Get year from date
dat$year <- as.integer(format(dat$date, "%Y"))
#Index where day in year is less than days
inds <- dat$day_in_year < dat$days
#Create a new dataframe with adjusted values
other_df <- data.frame(days = dat$days[inds] - dat$day_in_year[inds] + 1,
year = dat$year[inds] - 1)
#Update the original data
dat$days[inds] <- dat$day_in_year[inds] - 1
#Combine the two dataframe then aggregate
aggregate(days~year, rbind(dat[c('days', 'year')], other_df), sum)
# year days
#1 2018 334
#2 2019 90
How to aggregate data from years to decades and plot them?
First, strsplit
, make a proper year matrix, combine back with famines divided by number of years and reshape
to long format (lines 1:6). Next, aggregate
sum
s by decade and barplot
it.
r <- strsplit(data1$Year, '-|–|, ') |>
rapply(\(y) unlist(lapply(y, \(x) f(max(as.numeric(y)), x))), how='r') |>
{\(.) t(sapply(., \(x) `length<-`(x, max(lengths(.)))))}() |>
{\(.) cbind(`colnames<-`(., paste0('year.', seq_len(dim(.)[2]))),
n=dim(.)[2] - rowSums(is.na(.)))}() |>
{\(.) data.frame(., f=as.numeric(gsub('\\D', '',
data1$`Excess Mortality midpoint`))/
.[, 'n'])}()|>
reshape(1:3, direction='long') |>
stats:::aggregate.formula(formula=f ~ as.integer(substr(year, 1, 3)),
FUN=sum) |>
t()
## plot
op <- par(mar=c(5, 5, 4, 2)+.1) ## set/store old pars
b <- barplot(r, axes=FALSE, ylim=c(0, max(r[2, ])*1.05),
main='Famine victims', )
abline(h=asq, col='lightgrey', lty=3)
barplot(r, names.arg=paste0(r[1, ], '0s'), col='#20254c',
cex.names=.8, axes=FALSE, add=TRUE)
asq <- seq(0, max(axTicks(2)), 2e6)
axis(2, asq, labels=FALSE)
mtext(paste(asq/1e6, 'Million'), 2, 1, at=asq, las=2)
text(b, r[2, ] + 5e5, labels=formatC(r[2, ], format='d', big.mark=','), cex=.7)
box()
par(op) ## restore old pars
In line 2, I used this helper function f()
to fill up the pseudo-years:
f <- \(x1, x2, n1=nchar(x1)) {
u <- lapply(list(x1, x2), as.character)
s <- c(n1 - nchar(u[[2]]) + 1L, n1)
as.integer(`substr<-`(u[[1]], s[1], s[2], u[[2]]))
}
You can refine the aggregation method yourself to make the result exactly look like the original, but maybe this is better :)
R: aggregate date variable into years
First thing, you need to make proper date object from string:
(a <- as.Date("9/21/2000", "%m/%d/%Y"))
## [1] "2000-09-21"
Then you can extract year with:
format(a, "%Y")
## [1] "2000"
Which combines into one-liner, given you have vector with date:
format(as.Date(df$date, "%m/%d/%Y"), "%Y")
Aggregate Daily Data to Month/Year intervals
There is probably a more elegant solution, but splitting into months and years with strftime()
and then aggregate()
ing should do it. Then reassemble the date for plotting.
x <- as.POSIXct(c("2011-02-01", "2011-02-01", "2011-02-01"))
mo <- strftime(x, "%m")
yr <- strftime(x, "%Y")
amt <- runif(3)
dd <- data.frame(mo, yr, amt)
dd.agg <- aggregate(amt ~ mo + yr, dd, FUN = sum)
dd.agg$date <- as.POSIXct(paste(dd.agg$yr, dd.agg$mo, "01", sep = "-"))
Summary of data for each year in R
Base R
Here are two methods from base R.
The first uses cut
, split
and lapply
along with summary
.
creekFlowSummary <- lapply(split(creek, cut(creek$date, "1 year")),
function(x) summary(x[2]))
This creates a list
. You can view the summaries of different years by accessing the corresponding list index or name.
creekFlowSummary[1]
# $`1999-01-01`
# flow
# Min. :0.3187
# 1st Qu.:0.3965
# Median :0.4769
# Mean :0.6366
# 3rd Qu.:0.5885
# Max. :7.2560
#
creekFlowSummary["2000-01-01"]
# $`2000-01-01`
# flow
# Min. :0.1370
# 1st Qu.:0.1675
# Median :0.2081
# Mean :0.2819
# 3rd Qu.:0.2837
# Max. :2.3800
The second uses aggregate
:
aggregate(flow ~ cut(date, "1 year"), creek, summary)
# cut(date, "1 year") flow.Min. flow.1st Qu. flow.Median flow.Mean flow.3rd Qu. flow.Max.
# 1 1999-01-01 0.3187 0.3965 0.4770 0.6366 0.5885 7.2560
# 2 2000-01-01 0.1370 0.1675 0.2081 0.2819 0.2837 2.3800
# 3 2001-01-01 0.1769 0.2062 0.2226 0.2950 0.2574 2.9220
# 4 2002-01-01 0.1279 0.1781 0.2119 0.5346 0.4966 14.3900
# 5 2003-01-01 0.3492 0.4761 0.7173 1.0350 1.0840 10.1500
# 6 2004-01-01 0.4178 0.5379 0.6524 0.9691 0.9020 11.7100
# 7 2005-01-01 0.4722 0.6094 0.7279 1.2340 1.0900 17.7200
# 8 2006-01-01 0.2651 0.3275 0.4282 0.5459 0.5758 3.3510
# 9 2007-01-01 0.2784 0.3557 0.4041 0.6331 0.6125 9.6290
# 10 2008-01-01 0.4131 0.5430 0.6477 0.8792 0.9540 4.5960
# 11 2009-01-01 0.3877 0.4572 0.5945 0.8465 0.8309 6.3830
Be careful with the aggregate
solution though: All of the summary information is a single matrix. View str
on the output to see what I mean.
xts
There are, of course other ways to do this. One way is to use the xts
package.
First, convert your data to xts
:
library(xts)
creekx <- xts(creek$flow, order.by=creek$date)
Then, use apply.yearly
and whatever functions you are interested in.
Here is the yearly mean:
apply.yearly(creekx, mean)
# [,1]
# 1999-12-31 0.6365604
# 2000-12-31 0.2819057
# 2001-12-31 0.2950348
# 2002-12-31 0.5345666
# 2003-12-31 1.0351742
# 2004-12-31 0.9691180
# 2005-12-31 1.2338066
# 2006-12-31 0.5458652
# 2007-12-31 0.6331271
# 2008-12-31 0.8792396
# 2009-09-30 0.8465300
And the yearly maximum:
apply.yearly(creekx, max)
# [,1]
# 1999-12-31 7.256
# 2000-12-31 2.380
# 2001-12-31 2.922
# 2002-12-31 14.390
# 2003-12-31 10.150
# 2004-12-31 11.710
# 2005-12-31 17.720
# 2006-12-31 3.351
# 2007-12-31 9.629
# 2008-12-31 4.596
# 2009-09-30 6.383
Or, put them together like this: apply.yearly(creekx, function(x) cbind(mean(x), sum(x), max(x)))
data.table
The data.table
package may also be of interest for you, particularly if you are dealing with a lot of data. Here's a data.table
approach. The key is to use as.IDate
on your "date" column while you are reading your data in:
library(data.table)
DT <- data.table(date = as.IDate(creek$date), creek[-1])
DT[, list(mean = mean(flow),
tot = sum(flow),
max = max(flow)),
by = year(date)]
# year mean tot max
# 1: 1999 0.6365604 104.3959 7.256
# 2: 2000 0.2819057 103.1775 2.380
# 3: 2001 0.2950348 107.6877 2.922
# 4: 2002 0.5345666 195.1168 14.390
# 5: 2003 1.0351742 377.8386 10.150
# 6: 2004 0.9691180 354.6972 11.710
# 7: 2005 1.2338066 450.3394 17.720
# 8: 2006 0.5458652 199.2408 3.351
# 9: 2007 0.6331271 231.0914 9.629
# 10: 2008 0.8792396 321.8017 4.596
# 11: 2009 0.8465300 231.1027 6.383
How to sum a column based on specific year?
We can use substr
to get the 'year' part from 'Date' column, use that in subset
to extract the rows that have '2010' as year, select the 'Var1' column, and get the sum
sum(subset(df1, substr(Date,5,8)==2010, select=Var1))
Or a dplyr/lubridate
option would be using filter
and summarise
to get similar result.
library(lubridate)
library(dplyr)
df1 %>%
filter(Date=year(mdy(Date))==2010) %>%
summarise(Var1= sum(Var1))
Aggregate week and date in R by some specific rules
First we can convert the dates in df2
into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
R group by date, and summarize the values
Use as.Date()
then aggregate()
.
energy$Date <- as.Date(energy$Datetime)
aggregate(energy$value, by=list(energy$Date), sum)
EDIT
Emma made a good point about column names. You can preserve column names in aggregate
by using the following instead:
aggregate(energy["value"], by=energy["Date"], sum)
Related Topics
Replacing the Duplicate Values Except 1 Row in R Dataframe
How to Learn How to Write C Code to Speed Up Slow R Functions
How to Combine Row and Column Layout in Flexdashboard
R Sequence of Dates with Lubridate
How to Check If a Sequence of Numbers Is Monotonically Increasing (Or Decreasing)
Plotly as Png in Knitr/Rmarkdown
How to Plot a Classification Graph of a Svm in R
How to Create a Time-Spiral Graph Using R
Stat_Contour with Data Labels on Lines
Dplyr Join Warning: Joining Factors with Different Levels
How to Produce Time Series for Each Row of a Data Frame with an Unnamed First Column
Adding Custom Image to Geom_Polygon Fill in Ggplot
How to Ignore Case When Using Str_Detect
Ggplot2 - Shade Area Above Line