Calculate Summary Statistics (E.G. Mean) on All Numeric Columns Using Data.Table

Calculate summary statistics (e.g. mean) on all numeric columns using data.table

By searching on SO for .SDcols, I landed up on this answer, which I think explains quite nicely how to use it.

cols = sapply(mydt, is.numeric)
cols = names(cols)[cols]
mydt[, lapply(.SD, mean), .SDcols = cols]
# vnum1 vint1
# 1: -0.046491 4.5

Doing mydt[, sapply(mydt, is.numeric), with = FALSE] (note: the "modern" way to do that is mydt[ , .SD, .SDcols = is.numeric])is not that efficient because it subsets your data.table with those columns and that makes a (deep) copy - more memory used unnecessarily.

And using colMeans coerces the data.table into a matrix, which again is not so memory efficient.

Summary of all numeric variables with collapse R package

I am not sure if you are looking for this.

library(collapse)
library(magrittr)
wlddev %>%
fgroup_by(region, income) %>%
fsummarise(across(is.numeric, fmean, w = POP))

Select columns by class (e.g. numeric) from a data.table

data.table needs the with=FALSE to grab column numbers.

tokeep <- which(sapply(x,is.numeric))
x[ , tokeep, with=FALSE]

R Summary stats by column of data table

For the single-column stats, both of the other proposed solutions work well. For the two-column stats, this may not be the most elegant solution, but it works:

vsCols <- colnames(dtTest)
dtColDesc <- data.table()
for (lasCol in vsCols) {
ldtVar <- data.table()
ladEarliest <- dtTest[!is.na(dtTest[[lasCol]]),list(dEarliest=min(dObsDt))][[1]]
ladLatest <- dtTest[!is.na(dtTest[[lasCol]]),list(dLatest=max(dObsDt))][[1]]
ldtVar[,':=' (sColName = lasCol
, dEarliest = ladEarliest
, dLatest = ladLatest
)]
dtColDesc <- rbind(dtColDesc, ldtVar, fill=TRUE)
}
dtColDesc

Summary statistics for multiple variables with statistics as rows and variables as columns?

Here is a way using purrr to iterate over a list of functions. This is effectively what you were doing with bind_rows(), but in less code.

library(dplyr)
library(purrr)

funs <- lst(min, median, mean, max, sd)

map_dfr(funs,
~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
.id = "statistic")

# # A tibble: 5 x 4
# statistic height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 min 66 15 8
# 2 median 180 79 52
# 3 mean 174. 97.3 87.6
# 4 max 264 1358 896
# 5 sd 34.8 169. 155.

Creating summary statistics (summarise_all) for a large factor dataset, retaining factor info

By combining a lot of other answers, please see the appropriate links, I managed to deal with my problem as follows:

#1
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
#2
df[] = lapply(df, as.numeric.factor)
#3
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
#4
dfsummary = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=countryyear]

1, 2, 3, 4

How to summarize a variable based on another column in R?

Although there are reasonable suggestions by harre, I prefer to do it this way:

library(dplyr)

data |>
group_by(gender) |>
mutate(weight = as.numeric(weight)) |>
summarise(
across(weight, list(mean = mean, median = median))
)
# # A tibble: 2 x 3
# gender weight_mean weight_median
# <chr> <dbl> <dbl>
# 1 Female 70.3 65
# 2 Male 64.8 64.5

The advantages of mutate(across()) are that if you had 2 columns, or 5, you could easily extend it e.g. mutate(across(weight:height)). There are more examples of this in the docs.



Related Topics



Leave a reply



Submit