How to Summarizing Data Statistics Using R

Writing a function for summary statistics in R

@akrun's suggestion for how to immediately solve your question is right on.

An alternative is to use the nesting functionality of tidyr by returning a single element list which contains a data.frame of your results.

summary_function <- function(x) {
summary <- list(tibble(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x[!is.na(x)])))
return(summary)
}

Then you can use across to do the same function to multiple columns:

dataSet %>%
group_by(Secretor, Timepoint) %>%
summarize(across(Gene1:Gene4, summary_function))
# A tibble: 8 x 6
# Groups: Secretor [2]
# Secretor Timepoint Gene1 Gene2 Gene3 Gene4
# <dbl> <dbl> <list> <list> <list> <list>
#1 0 1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#2 0 6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#3 0 12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#4 0 18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#5 1 1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#6 1 6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#7 1 12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#8 1 18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>

Now we can unnest those same columns using unnest with names_sep =:

dataSet %>%
group_by(Secretor, Timepoint) %>%
summarize(across(Gene1:Gene4, summary_function)) %>%
unnest(Gene1:Gene4, names_sep = "_")
# A tibble: 8 x 14
# Groups: Secretor [2]
# Secretor Timepoint Gene1_mean Gene1_sd Gene1_n Gene2_mean Gene2_sd Gene2_n Gene3_mean Gene3_sd Gene3_n
# <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl> <int>
#1 0 1 71.2 28.6 2 62.3 27.0 2 28.4 33.3 2
#2 0 6 5.40 7.43 2 58.6 29.1 2 37.0 33.9 2
#3 0 12 91.8 11.4 2 53.9 31.0 2 33.2 46.0 2
#4 0 18 51.5 65.0 2 65.3 40.2 2 63.8 32.7 2
#5 1 1 30.8 18.0 2 50.0 19.9 2 22.8 6.71 2
#6 1 6 63.9 49.2 2 59.9 41.8 2 30.9 39.5 2
#7 1 12 85.3 6.74 2 51.0 41.1 2 28.5 22.9 2
#8 1 18 41.7 44.8 2 80.2 24.0 2 64.7 17.4 2
## … with 3 more variables: Gene4_mean <dbl>, Gene4_sd <dbl>, Gene4_n <int>

This is a recent addition to tidyr and dplyr (version >=1.0.0), but can come handy.

How to summarize a variable based on another column in R?

Although there are reasonable suggestions by harre, I prefer to do it this way:

library(dplyr)

data |>
group_by(gender) |>
mutate(weight = as.numeric(weight)) |>
summarise(
across(weight, list(mean = mean, median = median))
)
# # A tibble: 2 x 3
# gender weight_mean weight_median
# <chr> <dbl> <dbl>
# 1 Female 70.3 65
# 2 Male 64.8 64.5

The advantages of mutate(across()) are that if you had 2 columns, or 5, you could easily extend it e.g. mutate(across(weight:height)). There are more examples of this in the docs.

R data.table way to create summary statistics table with self-defined function

library(data.table)
library(mlbench)
data(BostonHousing)
dt <- as.data.table(BostonHousing)

fun_stats <- function(x) {
min <- min(x, na.rm = TRUE)
max <- max(x, na.rm = TRUE)
mean <- mean(x, na.rm = TRUE)
summary <- list(min = min, max = max, mean = mean)
}

dt[, rbindlist(lapply(.SD, fun_stats), idcol = "var"),
.SDcols = is.numeric]
#> var min max mean
#> <char> <num> <num> <num>
#> 1: crim 0.00632 88.9762 3.6135236
#> 2: zn 0.00000 100.0000 11.3636364
#> 3: indus 0.46000 27.7400 11.1367787
#> 4: nox 0.38500 0.8710 0.5546951
#> 5: rm 3.56100 8.7800 6.2846344
#> 6: age 2.90000 100.0000 68.5749012
#> 7: dis 1.12960 12.1265 3.7950427
#> 8: rad 1.00000 24.0000 9.5494071
#> 9: tax 187.00000 711.0000 408.2371542
#> 10: ptratio 12.60000 22.0000 18.4555336
#> 11: b 0.32000 396.9000 356.6740316
#> 12: lstat 1.73000 37.9700 12.6530632
#> 13: medv 5.00000 50.0000 22.5328063

Created on 2022-06-24 by the reprex package (v2.0.1)

Summary of data for each year in R

Base R

Here are two methods from base R.

The first uses cut, split and lapply along with summary.

creekFlowSummary <- lapply(split(creek, cut(creek$date, "1 year")), 
function(x) summary(x[2]))

This creates a list. You can view the summaries of different years by accessing the corresponding list index or name.

creekFlowSummary[1]
# $`1999-01-01`
# flow
# Min. :0.3187
# 1st Qu.:0.3965
# Median :0.4769
# Mean :0.6366
# 3rd Qu.:0.5885
# Max. :7.2560
#
creekFlowSummary["2000-01-01"]
# $`2000-01-01`
# flow
# Min. :0.1370
# 1st Qu.:0.1675
# Median :0.2081
# Mean :0.2819
# 3rd Qu.:0.2837
# Max. :2.3800

The second uses aggregate:

aggregate(flow ~ cut(date, "1 year"), creek, summary)
# cut(date, "1 year") flow.Min. flow.1st Qu. flow.Median flow.Mean flow.3rd Qu. flow.Max.
# 1 1999-01-01 0.3187 0.3965 0.4770 0.6366 0.5885 7.2560
# 2 2000-01-01 0.1370 0.1675 0.2081 0.2819 0.2837 2.3800
# 3 2001-01-01 0.1769 0.2062 0.2226 0.2950 0.2574 2.9220
# 4 2002-01-01 0.1279 0.1781 0.2119 0.5346 0.4966 14.3900
# 5 2003-01-01 0.3492 0.4761 0.7173 1.0350 1.0840 10.1500
# 6 2004-01-01 0.4178 0.5379 0.6524 0.9691 0.9020 11.7100
# 7 2005-01-01 0.4722 0.6094 0.7279 1.2340 1.0900 17.7200
# 8 2006-01-01 0.2651 0.3275 0.4282 0.5459 0.5758 3.3510
# 9 2007-01-01 0.2784 0.3557 0.4041 0.6331 0.6125 9.6290
# 10 2008-01-01 0.4131 0.5430 0.6477 0.8792 0.9540 4.5960
# 11 2009-01-01 0.3877 0.4572 0.5945 0.8465 0.8309 6.3830

Be careful with the aggregate solution though: All of the summary information is a single matrix. View str on the output to see what I mean.

xts

There are, of course other ways to do this. One way is to use the xts package.

First, convert your data to xts:

library(xts)
creekx <- xts(creek$flow, order.by=creek$date)

Then, use apply.yearly and whatever functions you are interested in.

Here is the yearly mean:

apply.yearly(creekx, mean)
# [,1]
# 1999-12-31 0.6365604
# 2000-12-31 0.2819057
# 2001-12-31 0.2950348
# 2002-12-31 0.5345666
# 2003-12-31 1.0351742
# 2004-12-31 0.9691180
# 2005-12-31 1.2338066
# 2006-12-31 0.5458652
# 2007-12-31 0.6331271
# 2008-12-31 0.8792396
# 2009-09-30 0.8465300

And the yearly maximum:

apply.yearly(creekx, max)
# [,1]
# 1999-12-31 7.256
# 2000-12-31 2.380
# 2001-12-31 2.922
# 2002-12-31 14.390
# 2003-12-31 10.150
# 2004-12-31 11.710
# 2005-12-31 17.720
# 2006-12-31 3.351
# 2007-12-31 9.629
# 2008-12-31 4.596
# 2009-09-30 6.383

Or, put them together like this: apply.yearly(creekx, function(x) cbind(mean(x), sum(x), max(x)))

data.table

The data.table package may also be of interest for you, particularly if you are dealing with a lot of data. Here's a data.table approach. The key is to use as.IDate on your "date" column while you are reading your data in:

library(data.table)
DT <- data.table(date = as.IDate(creek$date), creek[-1])
DT[, list(mean = mean(flow),
tot = sum(flow),
max = max(flow)),
by = year(date)]
# year mean tot max
# 1: 1999 0.6365604 104.3959 7.256
# 2: 2000 0.2819057 103.1775 2.380
# 3: 2001 0.2950348 107.6877 2.922
# 4: 2002 0.5345666 195.1168 14.390
# 5: 2003 1.0351742 377.8386 10.150
# 6: 2004 0.9691180 354.6972 11.710
# 7: 2005 1.2338066 450.3394 17.720
# 8: 2006 0.5458652 199.2408 3.351
# 9: 2007 0.6331271 231.0914 9.629
# 10: 2008 0.8792396 321.8017 4.596
# 11: 2009 0.8465300 231.1027 6.383

Summary statistics for multiple variables with statistics as rows and variables as columns?

Here is a way using purrr to iterate over a list of functions. This is effectively what you were doing with bind_rows(), but in less code.

library(dplyr)
library(purrr)

funs <- lst(min, median, mean, max, sd)

map_dfr(funs,
~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
.id = "statistic")

# # A tibble: 5 x 4
# statistic height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 min 66 15 8
# 2 median 180 79 52
# 3 mean 174. 97.3 87.6
# 4 max 264 1358 896
# 5 sd 34.8 169. 155.

How to create a summary statistics table for transformed variables in r

You can just call log on the data you want to summarise first.

For example:

Untransformed summary:

> mtcars |> select(hp, drat, wt) |> summary()
hp drat wt
Min. : 52.0 Min. :2.760 Min. :1.513
1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581
Median :123.0 Median :3.695 Median :3.325
Mean :146.7 Mean :3.597 Mean :3.217
3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610
Max. :335.0 Max. :4.930 Max. :5.424

Tranformed:

> mtcars |> select(hp, drat, wt) |> log() |> summary()
hp drat wt
Min. :3.951 Min. :1.015 Min. :0.4141
1st Qu.:4.570 1st Qu.:1.125 1st Qu.:0.9479
Median :4.812 Median :1.307 Median :1.2009
Mean :4.882 Mean :1.269 Mean :1.1217
3rd Qu.:5.193 3rd Qu.:1.366 3rd Qu.:1.2835
Max. :5.814 Max. :1.595 Max. :1.6908

Is there an R function to summarize individual level data by country and year?

Thank you!

I actually just changed something and added your suggested code to the front and it worked! Here is the code that was able to work!

library(dplyr)
country_summary <- finaldata.gini %>%
group_by(iso3c, date) %>%
select(Ladder.Life.Present) %>%
summarise_each(funs(mean))


Related Topics



Leave a reply



Submit