Calculate summary statistics (e.g. mean) on all numeric columns using data.table
By searching on SO for .SDcols
, I landed up on this answer, which I think explains quite nicely how to use it.
cols = sapply(mydt, is.numeric)
cols = names(cols)[cols]
mydt[, lapply(.SD, mean), .SDcols = cols]
# vnum1 vint1
# 1: -0.046491 4.5
Doing mydt[, sapply(mydt, is.numeric), with = FALSE]
(note: the "modern" way to do that is mydt[ , .SD, .SDcols = is.numeric]
)is not that efficient because it subsets your data.table with those columns and that makes a (deep) copy - more memory used unnecessarily.
And using colMeans
coerces the data.table into a matrix
, which again is not so memory efficient.
Summary of all numeric variables with collapse R package
I am not sure if you are looking for this.
library(collapse)
library(magrittr)
wlddev %>%
fgroup_by(region, income) %>%
fsummarise(across(is.numeric, fmean, w = POP))
Select columns by class (e.g. numeric) from a data.table
data.table
needs the with=FALSE
to grab column numbers.
tokeep <- which(sapply(x,is.numeric))
x[ , tokeep, with=FALSE]
R Summary stats by column of data table
For the single-column stats, both of the other proposed solutions work well. For the two-column stats, this may not be the most elegant solution, but it works:
vsCols <- colnames(dtTest)
dtColDesc <- data.table()
for (lasCol in vsCols) {
ldtVar <- data.table()
ladEarliest <- dtTest[!is.na(dtTest[[lasCol]]),list(dEarliest=min(dObsDt))][[1]]
ladLatest <- dtTest[!is.na(dtTest[[lasCol]]),list(dLatest=max(dObsDt))][[1]]
ldtVar[,':=' (sColName = lasCol
, dEarliest = ladEarliest
, dLatest = ladLatest
)]
dtColDesc <- rbind(dtColDesc, ldtVar, fill=TRUE)
}
dtColDesc
Summary statistics for multiple variables with statistics as rows and variables as columns?
Here is a way using purrr
to iterate over a list of functions. This is effectively what you were doing with bind_rows()
, but in less code.
library(dplyr)
library(purrr)
funs <- lst(min, median, mean, max, sd)
map_dfr(funs,
~ summarize(starwars, across(where(is.numeric), .x, na.rm = TRUE)),
.id = "statistic")
# # A tibble: 5 x 4
# statistic height mass birth_year
# <chr> <dbl> <dbl> <dbl>
# 1 min 66 15 8
# 2 median 180 79 52
# 3 mean 174. 97.3 87.6
# 4 max 264 1358 896
# 5 sd 34.8 169. 155.
Creating summary statistics (summarise_all) for a large factor dataset, retaining factor info
By combining a lot of other answers, please see the appropriate links, I managed to deal with my problem as follows:
#1
as.numeric.factor <- function(x) {as.numeric(as.character(x))}
#2
df[] = lapply(df, as.numeric.factor)
#3
cols = sapply(df, is.numeric)
cols = names(cols)[cols]
#4
dfsummary = df[, lapply(.SD, mean, na.rm=TRUE), .SDcols = cols, by=countryyear]
1, 2, 3, 4
How to summarize a variable based on another column in R?
Although there are reasonable suggestions by harre, I prefer to do it this way:
library(dplyr)
data |>
group_by(gender) |>
mutate(weight = as.numeric(weight)) |>
summarise(
across(weight, list(mean = mean, median = median))
)
# # A tibble: 2 x 3
# gender weight_mean weight_median
# <chr> <dbl> <dbl>
# 1 Female 70.3 65
# 2 Male 64.8 64.5
The advantages of mutate(across())
are that if you had 2 columns, or 5, you could easily extend it e.g. mutate(across(weight:height))
. There are more examples of this in the docs.
Related Topics
Locator Equivalent in Ggplot2 (For Maps)
Can You Pass a Vector to a Vararg: Vector to Sprintf
Generate Random Integers Between Two Values with a Given Probability Using R
How to Adjust the Font Size of Tablegrob
Why Does Lm Run Out of Memory While Matrix Multiplication Works Fine for Coefficients
Group Data in R for Consecutive Rows
R Corpus Is Messing Up My Utf-8 Encoded Text
Setting Column Width in R Shiny Datatable Does Not Work in Case of Lots of Column
Extend Axis Limits Without Plotting (In Order to Align Two Plots by X-Unit)
Rename Columns Using 'Starts_With()' Where New Prefix Is a String
How to Reverse Legend (Labels and Color) So High Value Starts at Bottom
How to Get Discrete Factor Levels to Be Treated as Continuous
Linear Model with 'Lm': How to Get Prediction Variance of Sum of Predicted Values
How to Always Display 3 Decimal Places in Datatables in R Shiny
How to Replace Multiple Values at Once
Transpose Only Certain Columns in Data.Frame
Replace Missing Values with a Value from Another Column
Getting the Error "Level Sets of Factors Are Different" When Running a for Loop