summary statistics when group is not NA
You could try
library(dplyr)
library(tidyr)
dt %>%
group_by(group = coalesce(group1, group2, group3)) %>%
summarize(
mean_weight = mean(weight, na.rm = TRUE),
mean_height = mean(height, na.rm = TRUE)
)
which returns
# A tibble: 3 x 3
group mean_weight mean_height
<dbl> <dbl> <dbl>
1 1 2.5 10
2 2 4 14.5
3 3 7 15.5
If there are multiple groups per row, you could use
dt %>%
pivot_longer(
starts_with("group"),
values_drop_na = TRUE,
values_to = "group"
) %>%
group_by(group) %>%
summarize(
mean_weight = mean(weight, na.rm = TRUE),
mean_height = mean(height, na.rm = TRUE)
)
which returns basically the same
# A tibble: 3 x 3
group mean_weight mean_height
<dbl> <dbl> <dbl>
1 1 2.5 10
2 2 4 14.5
3 3 7 15.5
How does this work?
- First we bring the data into a "long" format using
pivot_longer
. We take every column starting with "group" (starts_with("group")
). Those column names go into the new columnname
(default name, you can change it withnames_to = "YourNewColumnNameHere"
). The values are put into the new columngroup
usingvalues_to = "group"
. If you don't use this argument, the values are stored in the columnvalue
by default. Thevalues_drop_na = TRUE
takes care of every cell containing anNA
value. These are removed.
So after usingpivot_longer
the transformed data looks like
# A tibble: 6 x 4
weight height name group
<dbl> <dbl> <chr> <dbl>
1 3 10 group1 1
2 2 NA group1 1
3 3 14 group2 2
4 5 15 group2 2
5 NA 11 group3 3
6 7 20 group3 3
- Next we
group_by(group)
so the next transformations are applied to each independent group and don't affect each other. summarize
takes the grouping and calculates the new columns for each group. All other columns are dropped. Thena.rm = TRUE
argument ofmean()
takes care ofNA
values: those are ignored. Without this argument,mean_weight
for group3
would beNA
.
Edit
Thanks to akruns comment this can be generalized to multiple columns without reshaping by using the big-bang-operator:
dt %>%
group_by(group = coalesce(!!! select(., starts_with('group')))) %>%
summarise(across(c(weight, height), mean, na.rm = TRUE))
akrun uses a more advanced approach:
- The core of this is the complex
group_by
statement. coalesce()
is a function used to get the first non-NA
-element in order of the arguments. For example:coalesce(NA_real_, 1, 2)
returns1
since this is the first non-NA
-value. Sincecoalesce()
is vectorized,coalesce(group1, group2, group3)
and your group columns contain only one value per row
dt %>%
group_by(group = coalesce(group1, group2, group3))
returns an already grouped tibble.
# A tibble: 6 x 6
# Groups: group [3]
group1 group2 group3 weight height group
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA NA 3 10 1
2 1 NA NA 2 NA 1
3 NA 2 NA 3 14 2
4 NA 2 NA 5 15 2
5 NA NA 3 NA 11 3
6 NA NA 3 7 20 3
- If there are many/multiple columns named
"group"
, we don't want to typecoalesce(group1, ... , group100)
. So we are using a function, that selects all those columns. Here!!! select(., starts_with('group'))
is the best way: Take every column of your data.frame starting with "group". Unfortunately,select
returns a data.frame (which is a special version of a list of vectors). We need to supplycoalesce
with multiple vectors as arguments. A list of vectors won't do the job:
dt %>%
group_by(group = coalesce(select(., starts_with('group'))))
returns
# Groups: group [3]
group1 group2 group3 weight height group$group1 $group2 $group3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA NA 3 10 1 NA NA
2 1 NA NA 2 NA 1 NA NA
3 NA 2 NA 3 14 NA 2 NA
4 NA 2 NA 5 15 NA 2 NA
5 NA NA 3 NA 11 NA NA 3
6 NA NA 3 7 20 NA NA 3
which isn't what we are looking for. The big-bang-operator !!!
separates this list of vectors into multiple single vectors that are given to coalesce
as arguments. So
dt %>%
group_by(group = coalesce(!!! select(., starts_with('group'))))
returns
# A tibble: 6 x 6
# Groups: group [3]
group1 group2 group3 weight height group
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA NA 3 10 1
2 1 NA NA 2 NA 1
3 NA 2 NA 3 14 2
4 NA 2 NA 5 15 2
5 NA NA 3 NA 11 3
6 NA NA 3 7 20 3
- So now we have a grouped data.frame/tibble and we can apply the
summarise
function:summarise(across(c(weight, height), mean, na.rm = TRUE))
.across()
tellssummarise
to apply the functionmean
with argumentna.rm = TRUE
to the columnsweight
andheight
. That's a more elegant version of
summarize(
mean_weight = mean(weight, na.rm = TRUE),
mean_height = mean(height, na.rm = TRUE)
)
without the renaming (which also could have been done with across
).
How to get summary statistics for multiple variables by multiple groups?
Here's an illustration of reshaping your data first. I've written a custom function to improve readability:
mysummary <- function(x,na.rm=F){
res <- list(mean=mean(x, na.rm=na.rm),
sd=sd(x,na.rm=na.rm),
N=length(x))
res$se <- res$sd/sqrt(res$N)
res
}
library(data.table)
res <- melt(setDT(my.dat),id.vars=c("group1","group2","group3"))[,mysummary(value),
by=.(group1,group2,group3,variable)]
> head(res)
group1 group2 group3 variable mean sd N se
1: A E I dv1 9.75 6.994045 4 3.497023
2: B E I dv1 9.50 7.778175 2 5.500000
3: B F I dv1 16.00 4.082483 4 2.041241
4: C G I dv1 14.50 10.606602 2 7.500000
5: C G J dv1 10.75 10.372239 4 5.186119
6: D G J dv1 13.00 4.242641 2 3.000000
Or without the custom function, thanks to @Jaap
melt(setDT(my.dat),
id=c("group1","group2","group3"))[, .(mean = mean(value),
sd = sd(value),
n = .N,
se = sd(value)/sqrt(.N)),
.(group1, group2, group3, variable)]
Descriptive Statistics By Group - R
Using the sample data frame
dd<-read.csv(text="group_id,company,revenue
1,Company A,200
1,Company B,150
1,Company C,300
2,Company D,600
2,Company E,800
2,Company F,1000
3,Company G,50
3,Company H,80
3,Company H,60", header=T)
You could do something fancy like use ave()
to create all the values per row for your different functions and then just combine that with the original data.frame.
ext <- with(dd, Map(function(x) ave(revenue, group_id, FUN=x),
list(avg=mean, min=min, max=max, SD=sd)))
cbind(dd, ext)
# group_id company revenue avg min max SD
# 1 1 Company A 200 216.66667 150 300 76.37626
# 2 1 Company B 150 216.66667 150 300 76.37626
# 3 1 Company C 300 216.66667 150 300 76.37626
# 4 2 Company D 600 800.00000 600 1000 200.00000
# 5 2 Company E 800 800.00000 600 1000 200.00000
# 6 2 Company F 1000 800.00000 600 1000 200.00000
# 7 3 Company G 50 63.33333 50 80 15.27525
# 8 3 Company H 80 63.33333 50 80 15.27525
# 9 3 Company H 60 63.33333 50 80 15.27525
but really a simple dplyr command would be easier.
dd %>% group_by(group_id) %>%
mutate(
avg=mean(revenue),
min=min(revenue),
max=max(revenue),
SD=sd(revenue))
Dplyr: How to Rearrange and Split a Dataframe by A Categorical Group Within a Pivot Table Showing Summary Statistics in R
Does this give you what you want? (If you scroll to the bottom :)). It combines the answer from the previous issue with group_by
, excluding Country
and n
from the pivot_longer
and also renaming Center_Freq
so it names correctly when pivoting.
library(tidyverse)
Dummy <- structure(list(ID = 1:100, Low.Freq = c(435L, 94103292L, 1L,
2688L, 8471L, 28818L, 654755585L, 468628164L, 342491L, 2288474L,
3915L, 411L, 267864894L, 3312618L, 5383L, 8989443L, 1894L, 534981L,
9544861L, 3437614L, 475386L, 7550764L, 48744L, 2317845L, 5126197L,
2445L, 8L, 557450L, 450259742L, 21006647L, 9L, 7234027L, 59L,
9L, 605L, 9199L, 3022L, 30218156L, 46423L, 38L, 88L, 396396244L,
28934316L, 7723L, 95688045L, 679354L, 716352L, 76289L, 332826763L,
6L, 90975L, 83103577L, 9529L, 229093L, 42810L, 5L, 18175302L,
1443751L, 5831L, 8303661L, 86L, 778L, 23947L, 8L, 9829740L, 2075838L,
7434328L, 82174987L, 2L, 94037071L, 9638653L, 5L, 3L, 65972L,
0L, 936779338L, 4885076L, 745L, 8L, 56456L, 125140L, 73043989L,
516476L, 7L, 4440739L, 612L, 3966L, 8L, 9255L, 84127L, 96218L,
5690L, 56L, 3561L, 78738L, 1803363L, 809369L, 7131L, 0L, 35502443L
), High.Freq = c(6071L, 3210L, 6L, 7306092L, 6919054L, 666399L,
78L, 523880161L, 4700783L, 4173830L, 30L, 811L, 341014L, 780L,
44749L, 91L, 201620707L, 74L, 1L, 65422L, 595L, 89093186L, 946520L,
6940919L, 655350L, 4L, 6L, 618L, 2006697L, 889L, 1398L, 28769L,
90519642L, 984L, 0L, 296209525L, 487088392L, 5L, 894L, 529L,
5L, 99106L, 2L, 926017L, 9078L, 1L, 21L, 88601017L, 575770L,
48L, 8431L, 194L, 62324996L, 5L, 81L, 40634727L, 806901520L,
6818173L, 3501L, 91780L, 36106039L, 5834347L, 58388837L, 34L,
3280L, 6507606L, 19L, 402L, 584L, 76L, 4078684L, 199L, 6881L,
92251L, 81715L, 40L, 327L, 57764L, 97668898L, 2676483L, 76L,
4694L, 817120L, 51L, 116712L, 666L, 3L, 42841L, 9724L, 21L, 4L,
359L, 2604L, 22L, 30490L, 5640L, 34L, 51923625L, 35544L, 59644L
), Peak.Freq = c(87005561L, 9102L, 994839015L, 42745869L, 32840L,
62737133L, 2722L, 24L, 67404881L, 999242982L, 3048L, 85315406L,
703037627L, 331264L, 8403609L, 3934064L, 50578953L, 370110665L,
3414L, 12657L, 40L, 432L, 7707L, 214L, 68588962L, 69467L, 75L,
500297L, 704L, 1L, 102659072L, 60896923L, 4481230L, 94124925L,
60164619L, 447L, 580L, 8L, 172L, 9478521L, 20L, 53L, 3072127L,
2160L, 27301893L, 8L, 4263L, 508L, 712409L, 50677L, 522433683L,
112844L, 193385L, 458269L, 93578705L, 22093131L, 6L, 9L, 1690461L,
0L, 4L, 652847L, 44767L, 21408L, 5384L, 304L, 721L, 651147L,
2426L, 586L, 498289375L, 945L, 6L, 816L, 46207L, 39135L, 6621028L,
66905L, 26905085L, 4098L, 0L, 14L, 88L, 530L, 97809006L, 90L,
6L, 260792844L, 9L, 833205723L, 99467321L, 5L, 8455640L, 54090L,
2L, 309L, 299161148L, 4952L, 454824L, 729805154L), Delta.Freq = c(5L,
78L, 88553L, 794L, 5L, 3859122L, 782L, 36L, 8756801L, 243169338L,
817789L, 8792384L, 7431L, 626921743L, 9206L, 95789L, 7916L, 8143453L,
6L, 4L, 6363L, 181125L, 259618L, 6751L, 33L, 37960L, 0L, 2L,
599582228L, 565585L, 19L, 48L, 269450424L, 70676581L, 7830566L,
4L, 86484313L, 21L, 90899794L, 2L, 72356L, 574280L, 869544L,
73418L, 6468164L, 2259L, 5938505L, 31329L, 1249L, 354L, 8817L,
3L, 2568L, 82809L, 29836269L, 5230L, 37L, 33752014L, 79307L,
1736L, 8522076L, 40L, 2289135L, 862L, 801448L, 8026L, 5L, 15L,
4393771L, 405914L, 71098L, 950288L, 8319L, 1396973L, 832L, 70L,
1746L, 61907L, 8709547L, 300750537L, 45862L, 91417085L, 79892L,
47765L, 5477L, 18L, 4186L, 2860L, 754038591L, 375L, 53809223L,
72L, 136L, 509L, 232325L, 13128104L, 1692L, 8581L, 23L, 7L),
Delta.Time = c(1361082L, 7926L, 499L, 5004L, 3494530L, 213L,
64551179L, 70L, 797L, 5L, 72588L, 86976L, 5163L, 635080L,
3L, 91L, 919806257L, 81443L, 3135427L, 4410972L, 5810L, 8L,
46603718L, 422L, 1083626L, 48L, 15699890L, 7L, 90167635L,
446459879L, 2332071L, 761660L, 49218442L, 381L, 46L, 493197L,
46L, 798597155L, 45342274L, 6265842L, 6L, 3445819L, 351L,
1761227L, 214L, 959L, 908996387L, 6L, 3855L, 9096604L, 152664L,
7970052L, 32366926L, 31L, 5201618L, 114L, 7806411L, 70L,
239L, 5065L, 2L, 1L, 14472831L, 122042249L, 8L, 495604L,
29L, 8965478L, 2875L, 959L, 39L, 9L, 690L, 933626665L, 85294L,
580093L, 95934L, 982058L, 65244056L, 137508L, 29L, 7621L,
7527L, 72L, 2L, 315L, 6L, 2413L, 8625150L, 51298109L, 851L,
890460L, 160736L, 6L, 850842734L, 2L, 7L, 76969113L, 190536L,
7855L), Peak.Time = c(1465265L, 452894L, 545076172L, 8226275L,
5040875L, 700530L, 1L, 3639L, 20141L, 71712131L, 686L, 923L,
770569738L, 69961L, 737458636L, 122403L, 199502046L, 6108L,
907L, 108078263L, 7817L, 4L, 6L, 69L, 721L, 786353L, 87486L,
1563L, 876L, 47599535L, 79295722L, 53L, 7378L, 591L, 6607935L,
954L, 6295L, 75514344L, 5742050L, 25647276L, 449L, 328566184L,
4L, 2L, 2703L, 21367543L, 63429043L, 708L, 782L, 909820L,
478L, 50L, 922L, 579882L, 7850L, 534L, 2157492L, 96L, 6L,
716L, 5L, 653290336L, 447854237L, 2L, 31972263L, 645L, 7L,
609909L, 4054695L, 455631L, 4919894L, 9L, 72713L, 9997L,
84090765L, 89742L, 5L, 5028L, 4126L, 23091L, 81L, 239635020L,
3576L, 898597785L, 6822L, 3798L, 201999L, 19624L, 20432923L,
18944093L, 930720236L, 1492302L, 300122L, 143633L, 5152743L,
417344L, 813L, 55792L, 78L, 14203776L), Center_Freq = c(61907L,
8709547L, 300750537L, 45862L, 91417085L, 79892L, 47765L,
5477L, 18L, 4186L, 2860L, 754038591L, 375L, 53809223L, 72L,
136L, 4700783L, 4173830L, 30L, 811L, 341014L, 780L, 44749L,
91L, 201620707L, 74L, 1L, 65422L, 595L, 89093186L, 946520L,
6940919L, 48744L, 2317845L, 5126197L, 2445L, 8L, 557450L,
450259742L, 21006647L, 9L, 7234027L, 59L, 9L, 651547554L,
45554L, 38493L, 91055218L, 38L, 1116474L, 2295482L, 3001L,
9L, 3270L, 141L, 53644L, 667983L, 565598L, 84L, 971L, 555498297L,
60431L, 6597L, 856943893L, 607815536L, 4406L, 79L, 4885076L,
745L, 8L, 56456L, 125140L, 73043989L, 516476L, 7L, 4440739L,
754038591L, 375L, 53809223L, 72L, 136L, 509L, 232325L, 13128104L,
1692L, 8581L, 23L, 5874213L, 4550L, 644668065L, 3712371L,
5928L, 8833L, 7L, 2186023L, 61627221L, 37297L, 716427989L,
21387L, 26639L), Start.Freq = c(426355L, 22073538L, 680374L,
41771L, 54L, 6762844L, 599171L, 108L, 257451851L, 438814L,
343045L, 4702L, 967787L, 1937L, 18L, 89301735L, 366L, 90L,
954L, 7337732L, 70891703L, 4139L, 10397931L, 940000382L,
7L, 38376L, 878528819L, 6287L, 738366L, 31L, 47L, 5L, 6L,
77848L, 2366508L, 45L, 3665842L, 7252260L, 6L, 61L, 3247L,
448348L, 1L, 705132L, 144L, 7423637L, 2L, 497L, 844927639L,
78978L, 914L, 131L, 7089563L, 927L, 9595581L, 2774463L, 1651L,
73509280L, 7L, 35L, 18L, 96L, 1L, 92545512L, 27354947L, 7556L,
65019L, 7480L, 71835L, 8249L, 64792L, 71537L, 349389666L,
280244484L, 82L, 6L, 40L, 353872L, 0L, 103L, 1255L, 4752L,
29L, 76L, 81185L, 14L, 9L, 470775630L, 818361265L, 57947209L,
44L, 24L, 41295L, 4L, 261449L, 9931404L, 773556640L, 930717L,
65007421L, 341175L), End.Freq = c(71000996L, 11613579L, 71377155L,
1942738L, 8760748L, 79L, 455L, 374L, 8L, 5L, 2266932L, 597833L,
155488L, 3020L, 4L, 554L, 4L, 16472L, 1945649L, 668181101L,
649780L, 22394365L, 93060602L, 172146L, 20472L, 23558847L,
190513L, 22759044L, 44L, 78450L, 205621181L, 218L, 69916344L,
23884L, 66L, 312148L, 7710564L, 4L, 422L, 744572L, 651547554L,
45554L, 38493L, 91055218L, 38L, 1116474L, 2295482L, 3001L,
9L, 3270L, 141L, 55595L, 38451L, 8660867L, 14L, 96L, 345L,
6L, 44L, 8235824L, 910517L, 1424326L, 87102566L, 53644L,
667983L, 565598L, 84L, 971L, 555498297L, 60431L, 6597L, 856943893L,
607815536L, 4406L, 79L, 7L, 28978746L, 7537295L, 6L, 633L,
345860066L, 802L, 1035131L, 602L, 2740L, 8065L, 61370968L,
429953765L, 981507L, 8105L, 343787257L, 44782L, 64184L, 12981359L,
123367978L, 818775L, 123745614L, 25345654L, 3L, 800889L),
Species = c("Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Grampus_griseus", "Grampus_griseus", "Grampus_griseus",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Truncatus_Tursiops",
"Truncatus_Tursiops", "Truncatus_Tursiops", "Delphinus_Delphinus",
"Delphinus_Delphinus", "Delphinus_Delphinus", "Delphinus_Delphinus",
"Delphinus_Delphinus"), Country = c("Holland", "Holland",
"Holland", "Holland", "Holland", "Holland", "Holland", "Holland",
"Holland", "Holland", "Holland", "Holland", "Holland", "Holland",
"Holland", "Holland", "Holland", "Holland", "Holland", "Holland",
"Holland", "Holland", "Holland", "Holland", "France", "France",
"France", "France", "France", "France", "France", "France",
"France", "France", "France", "France", "France", "France",
"France", "France", "France", "France", "France", "France",
"France", "France", "France", "Holland", "Holland", "Holland",
"Holland", "Holland", "Holland", "Holland", "Holland", "Holland",
"Holland", "Holland", "Holland", "Holland", "Holland", "Holland",
"Holland", "Holland", "Holland", "Holland", "Holland", "Holland",
"Holland", "Holland", "Holland", "France", "France", "France",
"France", "France", "France", "France", "France", "France",
"France", "France", "France", "France&
Related Topics
Delete Rows That Exist in Another Data Frame
Selecting Multiple Odd or Even Columns/Rows for Dataframe
R Collapse Multiple Rows into 1 Row - Same Columns
Divide All Columns by the Value from the 2Nd Column - Apply for All Rows
R: Pulling Data from One Column to Create New Columns
If Else Statements to Check If a String Contains a Substring in R
Concatenating Two Text Columns in Dplyr
Adding Value from One Data.Frame to Another Data.Frame by Matching a Variable
How to Convert Only Some Positive Numbers to Negative Numbers (Conditional Recoding)
Dplyr Conditional Summarise Function
How to Add a Suffix (Or Prefix) Elements of an Existing List
Converting Data Frame into a List of Lists in R
Conditionally Replace Values of Subset of Rows With Column Name in R Using Only Tidy