Aggregate Data in R

How do I aggregate data in R in a way that returns the entire row that satisfies the aggregation condition? [no dplyr]

If you want to keep all the columns, use ave instead :

subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))

Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate]

Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.

library(dplyr)

Using aggregate

 Data %>% 
  aggregate(.~Year+Farms,., FUN=mean, na.rm=T, na.action=NULL) %>% 
  arrange(Farms, desc(Year)) %>% 
  as.data.frame() %>%  
  mutate_at(names(.), ~replace(., is.nan(.), NA))

Using summarize

Data %>% 
  group_by(Year, Farms) %>% 
  summarize(MeanCow = mean(Cow, na.rm=T),
            MeanDuck =  mean(Duck, na.rm=T),
            MeanChicken = mean(Chicken, na.rm=T),
            MeanSheep = mean(Sheep, na.rm=T),
            MeanHorse = mean(Horse, na.rm=T)) %>% 
  arrange(Farms, desc(Year)) %>% 
  as.data.frame() %>% 
  mutate_at(names(.), ~replace(., is.nan(.), NA))

Solution for both

      Year  Farms  Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0     110  25.0  22.5
2 2019 Farm 1 14.0  6.0      65  10.0  13.5
3 2018 Farm 1  8.0   NA      10  14.5  12.0
4 2020 Farm 2 31.0 20.5      29  15.0  14.0
5 2019 Farm 2 11.5 40.5      43  18.5  42.5
6 2018 Farm 2 36.5 26.5      28  30.0  11.0
7 2020 Farm 3 38.5  9.0      37  30.0  42.0
8 2019 Farm 3   NA 10.5      NA  20.0  11.5
9 2018 Farm 3   NA  7.0      24  38.0  42.0

How to aggregate data based on dates in R?

Here is one option using dplyr.

Take the difference between current Start_date and the previous End_date if the difference is greater than 1 day then merge the dates.

library(dplyr)

df %>%
  mutate(across(-Name, lubridate::dmy)) %>%
  group_by(Name) %>%
  group_by(grp = cumsum(Start_Date - lag(End_Date, default = first(Start_Date)) > 1), .add = TRUE) %>%
  summarise(DOB = first(DOB), 
            Start_Date = min(Start_Date), 
            End_Date = max(End_Date), .groups = 'drop') %>%
  select(-grp)

#   Name    DOB        Start_Date End_Date  
#  <chr>   <date>     <date>     <date>    
#1 JaneDoe 1985-01-01 2018-06-20 2018-07-02
#2 JaneDoe 1985-01-01 2018-07-30 2018-07-31
#3 JohnDoe 2000-01-01 2015-05-22 2015-06-20
#4 JohnDoe 2000-01-01 2015-07-07 2015-07-08

data

It is easier to help if you provide data in a reproducible format

df <- structure(list(Name = c("JohnDoe", "JohnDoe", "JohnDoe", "JohnDoe", 
"JaneDoe", "JaneDoe", "JaneDoe", "JaneDoe"), DOB = c("1/01/2000", 
"1/01/2000", "1/01/2000", "1/01/2000", "1/01/1985", "1/01/1985", 
"1/01/1985", "1/01/1985"), Start_Date = c("22/05/2015", "1/06/2015", 
"16/06/2015", "7/07/2015", "20/06/2018", "22/06/2018", "1/07/2018", 
"30/07/2018"), End_Date = c("31/05/2015", "15/06/2015", "20/06/2015", 
"8/07/2015", "21/06/2018", "30/06/2018", "2/07/2018", "31/07/2018"
)), class = "data.frame", row.names = c(NA, -8L))

De-aggregate a data frame

Here's a tidyverse solution.

As you say, it's easy to repeat a row an arbitrary number of times. If you know that row_number() counts rows within groups when a data frame is grouped, then it's easy to convert grouped counts to presence/absence flags. across gives you a way to succinctly convert multiple count columns.

library(tidyverse)

tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3)) %>% 
  uncount(total_N) %>% 
  group_by(group) %>% 
  mutate(
    across(
      starts_with("measure"), 
      function(x) as.numeric(row_number() <= x)
    )
  ) %>%
  ungroup()
# A tibble: 9 × 3
  group measure_A measure_B
  <chr>     <dbl>     <dbl>
1 A             1         1
2 A             0         1
3 A             0         0
4 A             0         0
5 B             1         1
6 B             1         1
7 B             1         1
8 B             1         0
9 B             0         0

As you say, this approach takes no account of correlations between the outcome columns, as this cannot be deduced from the grouped data.

How to aggregate data one after another in r?

We can use rollmean from zoo

library(zoo)
rollmean(h_1, 2)
#[1]  3.0  5.0  6.5  8.0 10.0
rollmean(h_1, 3)
#[1] 4.000000 5.666667 7.333333 9.000000