Quick/Elegant Way to Construct Mean/Variance Summary Table

quick/elegant way to construct mean/variance summary table

I'm a bit puzzled. Does this not work:

mvtab2 <- ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))

This give me something like this:

   f1 f2  f3    y.mean       y.var
1 A a I 0.6502307 0.095379578
2 A a II 0.4876630 0.110796695
3 A a III 0.3102926 0.202805677
4 A b I 0.3914084 0.058693103
5 A b II 0.5257355 0.218631264

Which is in the right form, but it looks like the values are different that what you specified.

Edit

Here's how to make your version with numcolwise work:

mvtab2 <- ddply(subset(d,select=-c(z,rep)),.(f1,f2,f3),summarise,
y.mean = numcolwise(mean)(piece),
y.var = numcolwise(var)(piece))

You forgot to pass the actual data to numcolwise. And then there's the little ddply trick that each piece is called piece internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions of plyr.)

Summarize a dataframe by groups

using data.table

library(data.table)
groups <- data.table(groups, key="Group")
DT <- data.table(df)

groups[, rowMeans(DT[, Class, with=FALSE]), by=Group][, setnames(as.data.table(matrix(V1, ncol=length(unique(Group)))), unique(Group))]

G1 G2
1: -0.13052091 -0.3667552
2: 1.17178729 -0.5496347
3: 0.23115841 0.8317714
4: 0.45209516 -1.2180895
5: -0.01861638 -0.4174929
6: -0.43156831 0.9008427
7: -0.64026238 0.1854066
8: 0.56225108 -0.3563087
9: -2.00405840 -0.4680040
10: 0.57608055 -0.6177605

# Also, make sure you have characters, not factors,
groups[, Class := as.character(Class)]
groups[, Group := as.character(Group)]

simple base:

 tapply(groups$Class, groups$Group, function(X) rowMeans(df[, X]))

using sapply :

 sapply(unique(groups$Group), function(X) 
rowMeans(df[, groups[groups$Group==X, "Class"]]) )

Two way table with mean of a third variable R

It seems you want Excel like pivot table. Here package pivottabler helps much. See, it generates nice html tables too (apart from displaying results)

library(pivottabler)
qpvt(df, "Country", "Stars", "mean(Price)")

2 3 4 5 Total
Canada 453 786 499.5 687 585
China 445.5 234 1200 987 662.4
Russia 560.5 673 598
Total 448 543.666666666667 709 837 614.0625

for formatting use format argument

qpvt(df, "Country", "Stars", "mean(Price)", format = "%.2f")
2 3 4 5 Total
Canada 453.00 786.00 499.50 687.00 585.00
China 445.50 234.00 1200.00 987.00 662.40
Russia 560.50 673.00 598.00
Total 448.00 543.67 709.00 837.00 614.06

for html output use qhpvt instead.


qhpvt(df, "Country", "Stars", "mean(Price)")

Output
Sample Image

Note: tidyverse and baseR methods are also possible and are easy too

Dplyr calculate mean and variance without all the data

You can manually calculate the mean and variance from the formulas, once you have count_plot computed.

Variance computed as sum((x - mean(x))^2)/(length(x) - 1)

df3 %>% 
left_join(site_count) %>%
group_by(strata) %>%
summarise(N = unique(count_plot),
mcount = sum(observed)/N,
varcount = sum((observed - mcount)^2, (N - n())*mcount^2)/(N - 1)) %>%
select(-N)

# # A tibble: 3 x 3
# strata mcount varcount
# <dbl> <dbl> <dbl>
# 1 10.0 1.89 0.861
# 2 20.0 1.33 1.07
# 3 30.0 2.40 2.30

Which matches df2

df2

# A tibble: 3 x 3
strata mcount varcount
<dbl> <dbl> <dbl>
1 10.0 1.89 0.861
2 20.0 1.33 1.07
3 30.0 2.40 2.30

How can I calculate the variance of a list in python?

You can use numpy's built-in function var:

import numpy as np

results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]

print(np.var(results))

This gives you 28.822364260579157

If - for whatever reason - you cannot use numpy and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:

# calculate mean
m = sum(results) / len(results)

# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)

which gives you the identical result.

If you are interested in the standard deviation, you can use numpy.std:

print(np.std(results))
5.36864640860051

@Serge Ballesta explained very well the difference between variance n and n-1. In numpy you can easily set this parameter using the option ddof; its default is 0, so for the n-1 case you can simply do:

np.var(results, ddof=1)

The "by hand" solution is given in @Serge Ballesta's answer.

Both approaches yield 32.024849178421285.

You can set the parameter also for std:

np.std(results, ddof=1)
5.659050201086865

Filter out data.table columns based on summary statistics

These work:

dt[, .SD, .SDcols=unlist(mask)] 

dt[, .SD, .SDcols=which(unlist(mask))]

All together now:

variance.filter = function(df) {
mask = df[,lapply(.SD, function(x) sd(x,na.rm = TRUE) > 1)]
df = df[, .SD, .SDcols = unlist(mask)]
}

EDIT in the current development version of data.table (1.12.9), .SDcols accepts a function filter for columns, so this would work:

variance.filter = function(df) {
df[ , .SD, .SDcols = function(x) sd(x, na.rm = TRUE) > 1]
}

Pandas Data Frame Summary Table

Seems like you may get use out of DataFrame.agg(), with which you can essentially build a customized .describe() output. Here's an example to get you started:

import pandas as pd
import numpy as np

df = pd.DataFrame({ 'object': ['a', 'b', 'c'],
'numeric': [1, 2, 3],
'numeric2': [1.1, 2.5, 50.],
'categorical': pd.Categorical(['d','e','f'])
})

def nullcounts(ser):
return ser.isnull().sum()

def custom_describe(frame, func=[nullcounts, 'sum', 'mean', 'median', 'max'],
numeric_only=True, **kwargs):
if numeric_only:
frame = frame.select_dtypes(include=np.number)
return frame.agg(func, **kwargs)

custom_describe(df)

numeric numeric2
nullcounts 0.0 0.000000
sum 6.0 53.600000
mean 2.0 17.866667
median 2.0 2.500000
max 3.0 50.000000


Related Topics



Leave a reply



Submit