How do I aggregate data in R in a way that returns the entire row that satisfies the aggregation condition? [no dplyr]
If you want to keep all the columns, use ave
instead :
subset(df, as.logical(ave(INT_VAR, ID, FUN = function(x) x == max(x))))
Error for NA using group_by or aggregate function [aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) : no rows to aggregate]
Here is a way to create the wanted data.frame. I think your solution has one error in row 2 (Sheep), where mean(NA, 10) is equal to 5 and not 10.
library(dplyr)
Using aggregate
Data %>%
aggregate(.~Year+Farms,., FUN=mean, na.rm=T, na.action=NULL) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Using summarize
Data %>%
group_by(Year, Farms) %>%
summarize(MeanCow = mean(Cow, na.rm=T),
MeanDuck = mean(Duck, na.rm=T),
MeanChicken = mean(Chicken, na.rm=T),
MeanSheep = mean(Sheep, na.rm=T),
MeanHorse = mean(Horse, na.rm=T)) %>%
arrange(Farms, desc(Year)) %>%
as.data.frame() %>%
mutate_at(names(.), ~replace(., is.nan(.), NA))
Solution for both
Year Farms Cow Duck Chicken Sheep Horse
1 2020 Farm 1 22.0 12.0 110 25.0 22.5
2 2019 Farm 1 14.0 6.0 65 10.0 13.5
3 2018 Farm 1 8.0 NA 10 14.5 12.0
4 2020 Farm 2 31.0 20.5 29 15.0 14.0
5 2019 Farm 2 11.5 40.5 43 18.5 42.5
6 2018 Farm 2 36.5 26.5 28 30.0 11.0
7 2020 Farm 3 38.5 9.0 37 30.0 42.0
8 2019 Farm 3 NA 10.5 NA 20.0 11.5
9 2018 Farm 3 NA 7.0 24 38.0 42.0
How to aggregate data based on dates in R?
Here is one option using dplyr
.
Take the difference between current Start_date
and the previous End_date
if the difference is greater than 1 day then merge the dates.
library(dplyr)
df %>%
mutate(across(-Name, lubridate::dmy)) %>%
group_by(Name) %>%
group_by(grp = cumsum(Start_Date - lag(End_Date, default = first(Start_Date)) > 1), .add = TRUE) %>%
summarise(DOB = first(DOB),
Start_Date = min(Start_Date),
End_Date = max(End_Date), .groups = 'drop') %>%
select(-grp)
# Name DOB Start_Date End_Date
# <chr> <date> <date> <date>
#1 JaneDoe 1985-01-01 2018-06-20 2018-07-02
#2 JaneDoe 1985-01-01 2018-07-30 2018-07-31
#3 JohnDoe 2000-01-01 2015-05-22 2015-06-20
#4 JohnDoe 2000-01-01 2015-07-07 2015-07-08
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(Name = c("JohnDoe", "JohnDoe", "JohnDoe", "JohnDoe",
"JaneDoe", "JaneDoe", "JaneDoe", "JaneDoe"), DOB = c("1/01/2000",
"1/01/2000", "1/01/2000", "1/01/2000", "1/01/1985", "1/01/1985",
"1/01/1985", "1/01/1985"), Start_Date = c("22/05/2015", "1/06/2015",
"16/06/2015", "7/07/2015", "20/06/2018", "22/06/2018", "1/07/2018",
"30/07/2018"), End_Date = c("31/05/2015", "15/06/2015", "20/06/2015",
"8/07/2015", "21/06/2018", "30/06/2018", "2/07/2018", "31/07/2018"
)), class = "data.frame", row.names = c(NA, -8L))
De-aggregate a data frame
Here's a tidyverse solution.
As you say, it's easy to repeat a row an arbitrary number of times. If you know that row_number()
counts rows within groups when a data frame is grouped, then it's easy to convert grouped counts to presence/absence flags. across
gives you a way to succinctly convert multiple count columns.
library(tidyverse)
tibble(group=c("A", "B"), total_N=c(4,5), measure_A=c(1,4), measure_B=c(2,3)) %>%
uncount(total_N) %>%
group_by(group) %>%
mutate(
across(
starts_with("measure"),
function(x) as.numeric(row_number() <= x)
)
) %>%
ungroup()
# A tibble: 9 × 3
group measure_A measure_B
<chr> <dbl> <dbl>
1 A 1 1
2 A 0 1
3 A 0 0
4 A 0 0
5 B 1 1
6 B 1 1
7 B 1 1
8 B 1 0
9 B 0 0
As you say, this approach takes no account of correlations between the outcome columns, as this cannot be deduced from the grouped data.
How to aggregate data one after another in r?
We can use rollmean
from zoo
library(zoo)
rollmean(h_1, 2)
#[1] 3.0 5.0 6.5 8.0 10.0
rollmean(h_1, 3)
#[1] 4.000000 5.666667 7.333333 9.000000
Related Topics
Block-Diagonal Binding of Matrices
How to Append Rows to an R Data Frame
Linear Regression with a Known Fixed Intercept in R
How to Pass Parameters to a Shiny App via Url
How to Extract the Row with Min or Max Values
Find Which Interval Row in a Data Frame That Each Element of a Vector Belongs In
Put a Break in the Y-Axis of a Histogram
Ggplot2 Shade Area Under Density Curve by Group
How to Use Functions in One R Package Masked by Another Package
Sort a Data.Table Fast by Ascending/Descending Order
Getting the Last N Elements of a Vector. Is There a Better Way Than Using the Length() Function
Show Names of Everything in a Package
Ggplot2: Adding Secondary Transformed X-Axis on Top of Plot
Differencebetween [ ] and [[ ]] in R