Is There Such "Colsd" in R

Is there such colsd in R?

I want to provide a fourth (very similar to @Thomas) approach and some benchmarking:

library("microbenchmark")
library("matrixStats")

colSdApply <- function(x, ...)apply(X=x, MARGIN=2, FUN=sd, ...)
colSdMatrixStats <- colSds

colSdColMeans <- function(x, na.rm=TRUE) {
if (na.rm) {
n <- colSums(!is.na(x)) # thanks @flodel
} else {
n <- nrow(x)
}
colVar <- colMeans(x*x, na.rm=na.rm) - (colMeans(x, na.rm=na.rm))^2
return(sqrt(colVar * n/(n-1)))
}

colSdThomas <- function(x)sqrt(rowMeans((t(x)-colMeans(x))^2)*((dim(x)[1])/(dim(x)[1]-1)))

m <- matrix(runif(1e7), nrow=1e3)

microbenchmark(colSdApply(m), colSdMatrixStats(m), colSdColMeans(m), colSdThomas(m))

# Unit: milliseconds
# expr min lq median uq max neval
# colSdApply(m) 435.7346 448.8673 456.6176 476.8373 512.9783 100
# colSdMatrixStats(m) 344.6416 357.5439 383.8736 389.0258 465.5715 100
# colSdColMeans(m) 124.2028 128.9016 132.9446 137.6254 172.6407 100
# colSdThomas(m) 231.5567 240.3824 245.4072 274.6611 307.3806 100


all.equal(colSdApply(m), colSdMatrixStats(m))
# [1] TRUE
all.equal(colSdApply(m), colSdColMeans(m))
# [1] TRUE
all.equal(colSdApply(m), colSdThomas(m))
# [1] TRUE

colSds does not work since the last matrixstats update in R

It looks like colSds only works on matrices. This works for me.

colSds(as.matrix(mtcars))

mpg cyl disp hp drat wt
6.0269481 1.7859216 123.9386938 68.5628685 0.5346787 0.9784574
qsec vs am gear carb
1.7869432 0.5040161 0.4989909 0.7378041 1.6152000

R: Calculate standard deviation in cols in a data.frame despite of NA-Values

You can try,

apply(df, 2, sd, na.rm = TRUE)

As the output of apply is a matrix, and you will most likely have to transpose it, a more direct and safer option is to use lapply or sapply as noted by @docendodiscimus,

sapply(df, sd, na.rm = TRUE)

How to calculate standard deviation per row?

apply lets you apply a function to all rows of your data:

apply(values_for_all, 1, sd, na.rm = TRUE)

To compute the standard deviation for each column instead, replace the 1 by 2.

Finding sum of all possible column combinations without repetition


DT %>%
group_by(Sample) %>%
summarise(s = combn(cur_data(), 3, \(x)c(nms = names(x), Sum = rowSums(x)),
simplify = FALSE), .groups = 'drop') %>%
unnest_wider(s) %>%
type.convert(as.is = TRUE)

# A tibble: 12 x 5
Sample nms1 nms2 nms3 Sum
<chr> <chr> <chr> <chr> <int>
1 A ColA ColB ColC 5
2 A ColA ColB ColD 6
3 A ColA ColC ColD 4
4 A ColB ColC ColD 6
5 B ColA ColB ColC 5
6 B ColA ColB ColD 6
7 B ColA ColC ColD 5
8 B ColB ColC ColD 5
9 C ColA ColB ColC 5
10 C ColA ColB ColD 6
11 C ColA ColC ColD 6
12 C ColB ColC ColD 4

How to plot multiple columns of a data frame to see where data exists in each column?

Here is the plot using ggplot:

Data

df <- structure(list(Index = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
ColA = c(NA, NA, NA, 1, NA, NA, 0, NA, NA, 2),
ColB = c(NA, 0, NA, 0, NA, 1, 1, 2, 0, 1),
ColC = c(0, 1, 2, 2, 2, 1, 0, 0, NA, 0),
ColD = c(NA, 0, 1, 2, NA, 1, 2, 2, 1, 0)),
.Names = c("Index", "ColA", "ColB", "ColC", "ColD"),
row.names = c(NA, -10L), class = "data.frame") 0, 1, 2, NA, 1, 2, 2, 1, 0)), .Names = c("Index", "ColA", "ColB", "ColC", "ColD"), row.names = c(NA, -10L), class = "data.frame")

Plot

library(ggplot2)
library(reshape2)
ggplot(melt(df, "Index"), aes(x=as.factor(Index), y=variable, alpha=!is.na(value))) +
geom_point() +
labs(x="Index", y="Variable") +
scale_alpha_discrete("", breaks=c(TRUE, FALSE), labels=c("Not NA", "NA"))

Sample Image



Related Topics



Leave a reply



Submit