Calculating Standard Deviation of Each Row

Calculating standard deviation of each row

You can use apply and transform functions

set.seed(007)
X <- data.frame(matrix(sample(c(10:20, NA), 100, replace=TRUE), ncol=10))
transform(X, SD=apply(X,1, sd, na.rm = TRUE))
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 SD
1 NA 12 17 18 19 16 12 13 20 14 3.041381
2 14 12 13 13 14 18 16 17 20 10 3.020302
3 11 19 NA 12 19 19 19 20 12 20 3.865805
4 10 11 20 12 15 17 18 17 18 12 3.496029
5 12 15 NA 14 20 18 16 11 14 18 2.958040
6 19 11 10 20 13 14 17 16 10 16 3.596294
7 14 16 17 15 10 11 15 15 11 16 2.449490
8 NA 10 15 19 19 12 15 15 19 14 3.201562
9 11 NA NA 20 20 14 14 17 14 19 3.356763
10 15 13 14 15 NA 13 15 NA 15 12 1.195229

From ?apply you can see ... which allows using optional arguments to FUN, in this case you can use na.rm=TRUE to omit NA values.

Using rowSds from matrixStats package also requires setting na.rm=TRUE to omit NA

library(matrixStats)
transform(X, SD=rowSds(X, na.rm=TRUE)) # same result as before.

How do you recalculate Standard Deviation at each row in a Dataframe?

This would do the trick:

df["Standard Deviation"] = df.groupby("Client ID")["Cost"].expanding(2).std(ddof=0).reset_index()["Cost"]
   Client ID  Session  Cost  Standard Deviation
0 1 0 10 NaN
1 1 1 11 0.500000
2 1 2 14 1.699673
3 2 0 15 NaN
4 2 1 16 0.500000
5 2 2 14 0.816497
6 2 3 22 3.112475


Explanation

You can rephrase your problem as:

Finding the cumulative standard deviation of the "Cost" column grouped by the "Client ID" column.

Pandas conveniently has built-in functions that handle both cumulative and group by computations.

Group By

A group by to compute the standard deviation looks like this:

df.groupby("Client ID")["Cost"].std()
Client ID
1 2.081666
2 3.593976

Cumulative

The cumulative standard deviation can be computed like this (note, we use ddof=0 to get the standard deviation of the population as a whole, which is what we want. we also use min_periods=2, otherwise the first row would have a value of 0.0 instead of NaN):

df.expanding(min_periods=2)["Cost"].std(ddof=0)
0         NaN
1 0.707107
2 2.081666
3 2.380476
4 2.588436
5 2.338090
6 3.909695

Group By + Cumulative

Combining the two, we get our result (note, we need to reset the index to lose the group by indexing and use the original index):

df.groupby("Client ID")["Cost"].expanding(2).std(ddof=0).reset_index()["Cost"]
0         NaN
1 0.500000
2 1.699673
3 NaN
4 0.500000
5 0.816497
6 3.112475

Calculating standard deviation across rows

Try this (using), withrowSds from the matrixStats package,

library(dplyr)
library(matrixStats)

columns <- c('colB', 'colC', 'colD')

df %>%
mutate(Mean= rowMeans(.[columns]), stdev=rowSds(as.matrix(.[columns])))

Returns

   colA colB colC colD     Mean    stdev
1 SampA 21 15 10 15.33333 5.507571
2 SampB 20 14 22 18.66667 4.163332
3 SampC 30 12 18 20.00000 9.165151

Your data

colA <- c("SampA", "SampB", "SampC")
colB <- c(21, 20, 30)
colC <- c(15, 14, 12)
colD <- c(10, 22, 18)
df <- data.frame(colA, colB, colC, colD)
df

How to calculate standard deviation per row?

apply lets you apply a function to all rows of your data:

apply(values_for_all, 1, sd, na.rm = TRUE)

To compute the standard deviation for each column instead, replace the 1 by 2.

How to calculate standard deviation with pandas for each row?

You can use .std(axis=1) [pandas-doc] instead, this will result in a Series with as indices the indices of your dataframe, and as values, the standard deviation of the two values in the corresponding columns:

>>> df.std(axis=1)
0 1.414214
1 2.687006
2 1.626346
3 1.223295
4 1.025305
5 1.732412
6 1.965757
dtype: float64


Related Topics



Leave a reply



Submit