How to Summarizing Data Statistics Using R

Writing a function for summary statistics in R

@akrun's suggestion for how to immediately solve your question is right on.

An alternative is to use the nesting functionality of tidyr by returning a single element list which contains a data.frame of your results.

summary_function <- function(x) {
  summary <- list(tibble(mean = mean(x, na.rm=T), sd = sd(x, na.rm=T), n = length(x[!is.na(x)])))
  return(summary)
}

Then you can use across to do the same function to multiple columns:

dataSet %>%
  group_by(Secretor, Timepoint) %>% 
  summarize(across(Gene1:Gene4, summary_function))
# A tibble: 8 x 6
# Groups:   Secretor [2]
#  Secretor Timepoint Gene1            Gene2            Gene3            Gene4           
#     <dbl>     <dbl> <list>           <list>           <list>           <list>          
#1        0         1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#2        0         6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#3        0        12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#4        0        18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#5        1         1 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#6        1         6 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#7        1        12 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>
#8        1        18 <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]> <tibble [1 × 3]>

Now we can unnest those same columns using unnest with names_sep =:

dataSet %>%
  group_by(Secretor, Timepoint) %>% 
  summarize(across(Gene1:Gene4, summary_function)) %>%
  unnest(Gene1:Gene4, names_sep = "_")
# A tibble: 8 x 14
# Groups:   Secretor [2]
#  Secretor Timepoint Gene1_mean Gene1_sd Gene1_n Gene2_mean Gene2_sd Gene2_n Gene3_mean Gene3_sd Gene3_n
#     <dbl>     <dbl>      <dbl>    <dbl>   <int>      <dbl>    <dbl>   <int>      <dbl>    <dbl>   <int>
#1        0         1      71.2     28.6        2       62.3     27.0       2       28.4    33.3        2
#2        0         6       5.40     7.43       2       58.6     29.1       2       37.0    33.9        2
#3        0        12      91.8     11.4        2       53.9     31.0       2       33.2    46.0        2
#4        0        18      51.5     65.0        2       65.3     40.2       2       63.8    32.7        2
#5        1         1      30.8     18.0        2       50.0     19.9       2       22.8     6.71       2
#6        1         6      63.9     49.2        2       59.9     41.8       2       30.9    39.5        2
#7        1        12      85.3      6.74       2       51.0     41.1       2       28.5    22.9        2
#8        1        18      41.7     44.8        2       80.2     24.0       2       64.7    17.4        2
## … with 3 more variables: Gene4_mean <dbl>, Gene4_sd <dbl>, Gene4_n <int>

This is a recent addition to tidyr and dplyr (version >=1.0.0), but can come handy.

How to summarize a variable based on another column in R?

Although there are reasonable suggestions by harre, I prefer to do it this way:

library(dplyr)

data  |>
    group_by(gender)  |>
    mutate(weight = as.numeric(weight))  |>
    summarise(
        across(weight, list(mean = mean, median = median))
    )
# # A tibble: 2 x 3
#   gender weight_mean weight_median
#   <chr>        <dbl>         <dbl>
# 1 Female        70.3          65
# 2 Male          64.8          64.5

The advantages of mutate(across()) are that if you had 2 columns, or 5, you could easily extend it e.g. mutate(across(weight:height)). There are more examples of this in the docs.

R data.table way to create summary statistics table with self-defined function

library(data.table)
library(mlbench)
data(BostonHousing)
dt <- as.data.table(BostonHousing)

fun_stats <- function(x) {
  min <- min(x, na.rm = TRUE)
  max <- max(x, na.rm = TRUE)
  mean <- mean(x, na.rm = TRUE)
  summary <- list(min = min, max = max, mean = mean)
}

dt[, rbindlist(lapply(.SD, fun_stats), idcol = "var"), 
   .SDcols = is.numeric]
#>         var       min      max        mean
#>      <char>     <num>    <num>       <num>
#>  1:    crim   0.00632  88.9762   3.6135236
#>  2:      zn   0.00000 100.0000  11.3636364
#>  3:   indus   0.46000  27.7400  11.1367787
#>  4:     nox   0.38500   0.8710   0.5546951
#>  5:      rm   3.56100   8.7800   6.2846344
#>  6:     age   2.90000 100.0000  68.5749012
#>  7:     dis   1.12960  12.1265   3.7950427
#>  8:     rad   1.00000  24.0000   9.5494071
#>  9:     tax 187.00000 711.0000 408.2371542
#> 10: ptratio  12.60000  22.0000  18.4555336
#> 11:       b   0.32000 396.9000 356.6740316
#> 12:   lstat   1.73000  37.9700  12.6530632
#> 13:    medv   5.00000  50.0000  22.5328063

^{Created on 2022-06-24 by the reprex package (v2.0.1)}

Summary of data for each year in R

Base R

Here are two methods from base R.

The first uses cut, split and lapply along with summary.

creekFlowSummary <- lapply(split(creek, cut(creek$date, "1 year")), 
                           function(x) summary(x[2]))

This creates a list. You can view the summaries of different years by accessing the corresponding list index or name.

creekFlowSummary[1]
# $`1999-01-01`
#       flow       
#  Min.   :0.3187  
#  1st Qu.:0.3965  
#  Median :0.4769  
#  Mean   :0.6366  
#  3rd Qu.:0.5885  
#  Max.   :7.2560  
# 
creekFlowSummary["2000-01-01"]
# $`2000-01-01`
#       flow       
#  Min.   :0.1370  
#  1st Qu.:0.1675  
#  Median :0.2081  
#  Mean   :0.2819  
#  3rd Qu.:0.2837  
#  Max.   :2.3800

The second uses aggregate:

aggregate(flow ~ cut(date, "1 year"), creek, summary)
#    cut(date, "1 year") flow.Min. flow.1st Qu. flow.Median flow.Mean flow.3rd Qu. flow.Max.
# 1           1999-01-01    0.3187       0.3965      0.4770    0.6366       0.5885    7.2560
# 2           2000-01-01    0.1370       0.1675      0.2081    0.2819       0.2837    2.3800
# 3           2001-01-01    0.1769       0.2062      0.2226    0.2950       0.2574    2.9220
# 4           2002-01-01    0.1279       0.1781      0.2119    0.5346       0.4966   14.3900
# 5           2003-01-01    0.3492       0.4761      0.7173    1.0350       1.0840   10.1500
# 6           2004-01-01    0.4178       0.5379      0.6524    0.9691       0.9020   11.7100
# 7           2005-01-01    0.4722       0.6094      0.7279    1.2340       1.0900   17.7200
# 8           2006-01-01    0.2651       0.3275      0.4282    0.5459       0.5758    3.3510
# 9           2007-01-01    0.2784       0.3557      0.4041    0.6331       0.6125    9.6290
# 10          2008-01-01    0.4131       0.5430      0.6477    0.8792       0.9540    4.5960
# 11          2009-01-01    0.3877       0.4572      0.5945    0.8465       0.8309    6.3830

Be careful with the aggregate solution though: All of the summary information is a single matrix. View str on the output to see what I mean.

`xts`

There are, of course other ways to do this. One way is to use the xts package.

First, convert your data to xts:

library(xts)
creekx <- xts(creek$flow, order.by=creek$date)

Then, use apply.yearly and whatever functions you are interested in.

Here is the yearly mean:

apply.yearly(creekx, mean)
#                 [,1]
# 1999-12-31 0.6365604
# 2000-12-31 0.2819057
# 2001-12-31 0.2950348
# 2002-12-31 0.5345666
# 2003-12-31 1.0351742
# 2004-12-31 0.9691180
# 2005-12-31 1.2338066
# 2006-12-31 0.5458652
# 2007-12-31 0.6331271
# 2008-12-31 0.8792396
# 2009-09-30 0.8465300

And the yearly maximum:

apply.yearly(creekx, max)
#              [,1]
# 1999-12-31  7.256
# 2000-12-31  2.380
# 2001-12-31  2.922
# 2002-12-31 14.390
# 2003-12-31 10.150
# 2004-12-31 11.710
# 2005-12-31 17.720
# 2006-12-31  3.351
# 2007-12-31  9.629
# 2008-12-31  4.596
# 2009-09-30  6.383

Or, put them together like this: apply.yearly(creekx, function(x) cbind(mean(x), sum(x), max(x)))

`data.table`

The data.table package may also be of interest for you, particularly if you are dealing with a lot of data. Here's a data.table approach. The key is to use as.IDate on your "date" column while you are reading your data in:

library(data.table)
DT <- data.table(date = as.IDate(creek$date), creek[-1])
DT[, list(mean = mean(flow),
          tot = sum(flow),
          max = max(flow)), 
   by = year(date)]
#     year      mean      tot    max
#  1: 1999 0.6365604 104.3959  7.256
#  2: 2000 0.2819057 103.1775  2.380
#  3: 2001 0.2950348 107.6877  2.922
#  4: 2002 0.5345666 195.1168 14.390
#  5: 2003 1.0351742 377.8386 10.150
#  6: 2004 0.9691180 354.6972 11.710
#  7: 2005 1.2338066 450.3394 17.720
#  8: 2006 0.5458652 199.2408  3.351
#  9: 2007 0.6331271 231.0914  9.629
# 10: 2008 0.8792396 321.8017  4.596
# 11: 2009 0.8465300 231.1027  6.383

Summary statistics for multiple variables with statistics as rows and variables as columns?

Here is a way using purrr to iterate over a list of functions. This is effectively what you were doing with bind_rows(), but in less code.

library(dplyr)
library(purrr)

funs <- lst(min, median, mean, max, sd)

map_dfr(funs,
        ~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
        .id = "statistic")

# # A tibble: 5 x 4
#   statistic height   mass birth_year
#   <chr>      <dbl>  <dbl>      <dbl>
# 1 min         66     15          8  
# 2 median     180     79         52  
# 3 mean       174.    97.3       87.6
# 4 max        264   1358        896  
# 5 sd          34.8  169.       155.

How to create a summary statistics table for transformed variables in r

You can just call log on the data you want to summarise first.

For example:

Untransformed summary:

> mtcars |> select(hp, drat, wt) |> summary()
       hp             drat             wt       
 Min.   : 52.0   Min.   :2.760   Min.   :1.513  
 1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581  
 Median :123.0   Median :3.695   Median :3.325  
 Mean   :146.7   Mean   :3.597   Mean   :3.217  
 3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610  
 Max.   :335.0   Max.   :4.930   Max.   :5.424

Tranformed:

> mtcars |> select(hp, drat, wt) |> log() |> summary()
       hp             drat             wt        
 Min.   :3.951   Min.   :1.015   Min.   :0.4141  
 1st Qu.:4.570   1st Qu.:1.125   1st Qu.:0.9479  
 Median :4.812   Median :1.307   Median :1.2009  
 Mean   :4.882   Mean   :1.269   Mean   :1.1217  
 3rd Qu.:5.193   3rd Qu.:1.366   3rd Qu.:1.2835  
 Max.   :5.814   Max.   :1.595   Max.   :1.6908

Is there an R function to summarize individual level data by country and year?

Thank you!

I actually just changed something and added your suggested code to the front and it worked! Here is the code that was able to work!

library(dplyr)
country_summary <- finaldata.gini %>% 
                     group_by(iso3c, date) %>% 
                     select(Ladder.Life.Present) %>% 
                     summarise_each(funs(mean))