quick/elegant way to construct mean/variance summary table
I'm a bit puzzled. Does this not work:
mvtab2 <- ddply(d,.(f1,f2,f3),
summarise,y.mean = mean(y),y.var = var(y))
This give me something like this:
f1 f2 f3 y.mean y.var
1 A a I 0.6502307 0.095379578
2 A a II 0.4876630 0.110796695
3 A a III 0.3102926 0.202805677
4 A b I 0.3914084 0.058693103
5 A b II 0.5257355 0.218631264
Which is in the right form, but it looks like the values are different that what you specified.
Edit
Here's how to make your version with numcolwise
work:
mvtab2 <- ddply(subset(d,select=-c(z,rep)),.(f1,f2,f3),summarise,
y.mean = numcolwise(mean)(piece),
y.var = numcolwise(var)(piece))
You forgot to pass the actual data to numcolwise
. And then there's the little ddply
trick that each piece is called piece
internally. (Which Hadley points out in the comments shouldn't be relied upon as it may change in future versions of plyr
.)
Summarize a dataframe by groups
using data.table
library(data.table)
groups <- data.table(groups, key="Group")
DT <- data.table(df)
groups[, rowMeans(DT[, Class, with=FALSE]), by=Group][, setnames(as.data.table(matrix(V1, ncol=length(unique(Group)))), unique(Group))]
G1 G2
1: -0.13052091 -0.3667552
2: 1.17178729 -0.5496347
3: 0.23115841 0.8317714
4: 0.45209516 -1.2180895
5: -0.01861638 -0.4174929
6: -0.43156831 0.9008427
7: -0.64026238 0.1854066
8: 0.56225108 -0.3563087
9: -2.00405840 -0.4680040
10: 0.57608055 -0.6177605
# Also, make sure you have characters, not factors,
groups[, Class := as.character(Class)]
groups[, Group := as.character(Group)]
simple base:
tapply(groups$Class, groups$Group, function(X) rowMeans(df[, X]))
using sapply
:
sapply(unique(groups$Group), function(X)
rowMeans(df[, groups[groups$Group==X, "Class"]]) )
Two way table with mean of a third variable R
It seems you want Excel like pivot table. Here package pivottabler
helps much. See, it generates nice html tables too (apart from displaying results)
library(pivottabler)
qpvt(df, "Country", "Stars", "mean(Price)")
2 3 4 5 Total
Canada 453 786 499.5 687 585
China 445.5 234 1200 987 662.4
Russia 560.5 673 598
Total 448 543.666666666667 709 837 614.0625
for formatting use format
argument
qpvt(df, "Country", "Stars", "mean(Price)", format = "%.2f")
2 3 4 5 Total
Canada 453.00 786.00 499.50 687.00 585.00
China 445.50 234.00 1200.00 987.00 662.40
Russia 560.50 673.00 598.00
Total 448.00 543.67 709.00 837.00 614.06
for html output use qhpvt
instead.
qhpvt(df, "Country", "Stars", "mean(Price)")
Output
Note: tidyverse
and baseR
methods are also possible and are easy too
Dplyr calculate mean and variance without all the data
You can manually calculate the mean and variance from the formulas, once you have count_plot
computed.
Variance computed as sum((x - mean(x))^2)/(length(x) - 1)
df3 %>%
left_join(site_count) %>%
group_by(strata) %>%
summarise(N = unique(count_plot),
mcount = sum(observed)/N,
varcount = sum((observed - mcount)^2, (N - n())*mcount^2)/(N - 1)) %>%
select(-N)
# # A tibble: 3 x 3
# strata mcount varcount
# <dbl> <dbl> <dbl>
# 1 10.0 1.89 0.861
# 2 20.0 1.33 1.07
# 3 30.0 2.40 2.30
Which matches df2
df2
# A tibble: 3 x 3
strata mcount varcount
<dbl> <dbl> <dbl>
1 10.0 1.89 0.861
2 20.0 1.33 1.07
3 30.0 2.40 2.30
How can I calculate the variance of a list in python?
You can use numpy's built-in function var
:
import numpy as np
results = [-14.82381293, -0.29423447, -13.56067979, -1.6288903, -0.31632439,
0.53459687, -1.34069996, -1.61042692, -4.03220519, -0.24332097]
print(np.var(results))
This gives you 28.822364260579157
If - for whatever reason - you cannot use numpy
and/or you don't want to use a built-in function for it, you can also calculate it "by hand" using e.g. a list comprehension:
# calculate mean
m = sum(results) / len(results)
# calculate variance using a list comprehension
var_res = sum((xi - m) ** 2 for xi in results) / len(results)
which gives you the identical result.
If you are interested in the standard deviation, you can use numpy.std:
print(np.std(results))
5.36864640860051
@Serge Ballesta explained very well the difference between variance n
and n-1
. In numpy you can easily set this parameter using the option ddof
; its default is 0
, so for the n-1
case you can simply do:
np.var(results, ddof=1)
The "by hand" solution is given in @Serge Ballesta's answer.
Both approaches yield 32.024849178421285
.
You can set the parameter also for std
:
np.std(results, ddof=1)
5.659050201086865
Filter out data.table columns based on summary statistics
These work:
dt[, .SD, .SDcols=unlist(mask)]
dt[, .SD, .SDcols=which(unlist(mask))]
All together now:
variance.filter = function(df) {
mask = df[,lapply(.SD, function(x) sd(x,na.rm = TRUE) > 1)]
df = df[, .SD, .SDcols = unlist(mask)]
}
EDIT in the current development version of data.table
(1.12.9), .SDcols
accepts a function filter for columns, so this would work:
variance.filter = function(df) {
df[ , .SD, .SDcols = function(x) sd(x, na.rm = TRUE) > 1]
}
Pandas Data Frame Summary Table
Seems like you may get use out of DataFrame.agg()
, with which you can essentially build a customized .describe()
output. Here's an example to get you started:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'object': ['a', 'b', 'c'],
'numeric': [1, 2, 3],
'numeric2': [1.1, 2.5, 50.],
'categorical': pd.Categorical(['d','e','f'])
})
def nullcounts(ser):
return ser.isnull().sum()
def custom_describe(frame, func=[nullcounts, 'sum', 'mean', 'median', 'max'],
numeric_only=True, **kwargs):
if numeric_only:
frame = frame.select_dtypes(include=np.number)
return frame.agg(func, **kwargs)
custom_describe(df)
numeric numeric2
nullcounts 0.0 0.000000
sum 6.0 53.600000
mean 2.0 17.866667
median 2.0 2.500000
max 3.0 50.000000
Related Topics
Assign Headers Based on Existing Row in Dataframe in R
R: Arranging Multiple Plots Together Using Gridextra
Options for Deploying R Models in Production
Extract Nested List Elements Using Bracketed Numbers and Names
Which Library Could Be Used to Make a Chord Diagram in R
Unique Elements of Two Vectors
How to Make a Post Request with Header and Data Options in R Using Httr::Post
Gathering Wide Columns into Multiple Long Columns Using Pivot_Longer
The Difference Between Domc and Doparallel in R
Hashtag Extract Function in R Programming
Error: --With-Readline=Yes (Default) and Headers/Libs Are Not Available
Confidence Intervals for Predictions from Logistic Regression
Ordering Permutation in Rcpp I.E. Base::Order()
Dplyr Join Warning: Joining Factors with Different Levels
Non-Linear Color Distribution Over the Range of Values in a Geom_Raster