Calculating Statistics on Subsets of Data

Calculating statistics on subsets of data

Use the base R function ave(), which despite its confusing name, can calculate a variety of statistics, including the mean:

within(mydata, mean<-ave(measure, subject, FUN=mean))

subject time measure mean
1 1 0 10 10.000000
2 1 1 12 10.000000
3 1 2 8 10.000000
4 2 0 7 2.333333
5 2 1 0 2.333333
6 2 2 0 2.333333

Note that I use within just for the sake of shorter code. Here is the equivalent without within():

mydata$mean <- ave(mydata$measure, mydata$subject, FUN=mean)
mydata
subject time measure mean
1 1 0 10 10.000000
2 1 1 12 10.000000
3 1 2 8 10.000000
4 2 0 7 2.333333
5 2 1 0 2.333333
6 2 2 0 2.333333

Subset data.frame in R and calculate summary statistics, recombine with zero row subsets

If I understood correctly, here is a tidyverse solution

library(tidyverse)

data.test <-
data.frame(
age = c( 1, 1.4, 1, 1.3, 4, 3.2, 3.3, 3.5, 4.3, 3.4, 4.6, 4.7, 0.4, 0.1, 0.3 ),
class.type = c( "A","B","B","A","B","A","B","B","B","A","B","B","B","B","A" ))

data.test %>%
#age.group = age as integer
mutate(age.group = as.integer(age)) %>%
#count = the number of observations in each age.group
add_count(age.group,name = "count") %>%
#Grouping by age.group
group_by(age.group) %>%
mutate(
#a.count = the number of observations with type A in each age.group
a.count = sum(class.type == "A"),
#b.count = the number of observations with type B in each age.group
b.count = sum(class.type == "B"),
#a.ratio = a.count/count
a.ratio = a.count/count
)

Sample Image

calculate statistics and/or a function on multiple columns in subsets of a dataframe in R

I am sure the dplyr solutions are coming, but the doBy package works very well for this kind of thing, unless you have many (millions+) rows, in which case it will be slow.

library(doBy)
df_avg <- summaryBy(. ~ Date + Shift, FUN=c(mean, median, sd), data=df, na.rm=TRUE)

Will give a dataframe with V1.mean, V1.median, and so on.
The . ~ means "summarize all numeric variables". If you want to keep information from some factors in the dataframe, use the argument id.vars = ~somefac+somefac2, for example.

Calculate statistics on subset of a dataframe based on values in dataframe (latitude and longitude)

  • Cartesian product to itself to get all combinations. This will be expensive on larger datasets. This generates N^2 rows, so in this case 25 rows
  • calculate distance on each of these combinations
  • filter query() to distances required
  • groupby() to get total number of people. Also generate a list of indexes included in total for helping with transparency
  • finally join() this back together and you have what you want
import geopy.distance as gd

df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})

df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))

print(df.to_markdown())






















































latitudelongitudepeoplenearbyindex_y
040.9919-106.04916[0, 1, 2]
140.992-106.04926[0, 1, 2]
240.9916-106.04936[0, 1, 2]
340.9899-106.0544[3]
440.9878-106.04955[4]

statistics on subsets of a pandas dataframe

Like this? It calculated the mean and standard deviation on each ID grouping from A.

df.groupby('A').agg({'B': ['mean', 'std']})
B
mean std
A
1 3.0 NaN
2 3.5 3.535534
4 6.0 NaN
6 5.5 2.121320
7 1.5 0.707107
8 2.5 0.707107

R: T-statistics with subsets

You may easily loop over a subset=.

t(with(mtcars, sapply(unique(cyl), \(i) t.test(am, subset=cyl == i))))
# statistic parameter p.value conf.int estimate null.value stderr alternative method data.name
# [1,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
# [2,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
# [3,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"

More specific for your data you could do this:

tcols <- c('Return', 'Age', 'Size')
r <- t(with(subset(dat, Rating %in% c(1, 5)),
sapply(setNames(tcols, tcols), \(i) unlist(
t.test(reformulate('Rating', i))[
c('estimate', 'statistic', 'p.value')]
))))
cbind(as.data.frame(r),
' '=c(" ", "* ", "** ", "***")[
rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))])
# estimate.mean in group 1 estimate.mean in group 5 statistic.t p.value
# Return 4.666667e-02 0.05 -0.1552301 0.8883096
# Age 5.333333e+00 5.60 -0.2198599 0.8353634
# Size 7.188400e+04 19724.60 4.0457818 0.0109848 *

Note R >= 4.1 used.

Edit

as.data.frame(t(with(subset(dat, Rating %in% c(1, 5)), 
sapply(setNames(tcols, tcols), \(i) unlist(
t.test(reformulate('Rating', i))[
c('estimate', 'statistic', 'p.value')]
))))) |>
{\(.) cbind(mean.diff.5.1=apply(.[1:2], 1, diff), .[3:4])}() |>
cbind(' '=c(" ", "* ", "** ", "***")[
rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))],
`colnames<-`(t(aggregate(. ~ Rating, dat[3:6], mean)[-1]),
paste0('mean.rating.', 1:5))) |>
{\(.) .[c(5:9, 1:4)]}()
# mean.rating.1 mean.rating.2 mean.rating.3 mean.rating.4 mean.rating.5 mean.diff.5.1 statistic.t p.value
# Return 4.666667e-02 0.07 0.01 0.0525 0.05 3.333333e-03 -0.1552301 0.8883096
# Age 5.333333e+00 7.00 9.00 3.2500 5.60 2.666667e-01 -0.2198599 0.8353634
# Size 7.188400e+04 94567.00 50243.00 5201.7500 19724.60 -5.215940e+04 4.0457818 0.0109848 *


Related Topics



Leave a reply



Submit