Calculate Summary Statistics (E.G. Mean) on All Numeric Columns Using Data.Table

Calculate summary statistics (e.g. mean) on all numeric columns using data.table

By searching on SO for .SDcols, I landed up on this answer, which I think explains quite nicely how to use it.

cols = sapply(mydt, is.numeric)
cols = names(cols)[cols]
mydt[, lapply(.SD, mean), .SDcols = cols]
#        vnum1 vint1
# 1: -0.046491   4.5

Doing mydt[, sapply(mydt, is.numeric), with = FALSE] (note: the "modern" way to do that is mydt[ , .SD, .SDcols = is.numeric])is not that efficient because it subsets your data.table with those columns and that makes a (deep) copy - more memory used unnecessarily.

And using colMeans coerces the data.table into a matrix, which again is not so memory efficient.

Summary of all numeric variables with collapse R package

I am not sure if you are looking for this.

library(collapse)
library(magrittr)
wlddev %>% 
  fgroup_by(region, income) %>%
  fsummarise(across(is.numeric, fmean, w = POP))

Select columns by class (e.g. numeric) from a data.table

data.table needs the with=FALSE to grab column numbers.

tokeep <- which(sapply(x,is.numeric))
x[ , tokeep, with=FALSE]

R Summary stats by column of data table

For the single-column stats, both of the other proposed solutions work well. For the two-column stats, this may not be the most elegant solution, but it works:

vsCols <- colnames(dtTest)
dtColDesc <- data.table()
for (lasCol in vsCols) {
  ldtVar <- data.table()
  ladEarliest <- dtTest[!is.na(dtTest[[lasCol]]),list(dEarliest=min(dObsDt))][[1]]
  ladLatest <- dtTest[!is.na(dtTest[[lasCol]]),list(dLatest=max(dObsDt))][[1]]
  ldtVar[,':=' (sColName = lasCol
                , dEarliest = ladEarliest
                , dLatest = ladLatest
  )]
  dtColDesc <- rbind(dtColDesc, ldtVar, fill=TRUE)
}
dtColDesc

Summary statistics for multiple variables with statistics as rows and variables as columns?

Here is a way using purrr to iterate over a list of functions. This is effectively what you were doing with bind_rows(), but in less code.

library(dplyr)
library(purrr)

funs <- lst(min, median, mean, max, sd)

map_dfr(funs,
        ~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
        .id = "statistic")

# # A tibble: 5 x 4
#   statistic height   mass birth_year
#   <chr>      <dbl>  <dbl>      <dbl>
# 1 min         66     15          8  
# 2 median     180     79         52  
# 3 mean       174.    97.3       87.6
# 4 max        264   1358        896  
# 5 sd          34.8  169.       155.

Creating summary statistics (summarise_all) for a large factor dataset, retaining factor info

By combining a lot of other answers, please see the appropriate links, I managed to deal with my problem as follows:

#1
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
#2
df[] = lapply(df, as.numeric.factor)
#3
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
#4
dfsummary = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=countryyear]

1, 2, 3, 4

How to summarize a variable based on another column in R?

Although there are reasonable suggestions by harre, I prefer to do it this way:

library(dplyr)

data  |>
    group_by(gender)  |>
    mutate(weight = as.numeric(weight))  |>
    summarise(
        across(weight, list(mean = mean, median = median))
    )
# # A tibble: 2 x 3
#   gender weight_mean weight_median
#   <chr>        <dbl>         <dbl>
# 1 Female        70.3          65
# 2 Male          64.8          64.5

The advantages of mutate(across()) are that if you had 2 columns, or 5, you could easily extend it e.g. mutate(across(weight:height)). There are more examples of this in the docs.

Calculate Summary Statistics (E.G. Mean) on All Numeric Columns Using Data.Table