Calculating Statistics on Subsets of Data

Calculating statistics on subsets of data

Use the base R function ave(), which despite its confusing name, can calculate a variety of statistics, including the mean:

within(mydata, mean<-ave(measure, subject, FUN=mean))

  subject time measure      mean
1       1    0      10 10.000000
2       1    1      12 10.000000
3       1    2       8 10.000000
4       2    0       7  2.333333
5       2    1       0  2.333333
6       2    2       0  2.333333

Note that I use within just for the sake of shorter code. Here is the equivalent without within():

mydata$mean <- ave(mydata$measure, mydata$subject, FUN=mean)
mydata
  subject time measure      mean
1       1    0      10 10.000000
2       1    1      12 10.000000
3       1    2       8 10.000000
4       2    0       7  2.333333
5       2    1       0  2.333333
6       2    2       0  2.333333

Subset data.frame in R and calculate summary statistics, recombine with zero row subsets

If I understood correctly, here is a tidyverse solution

library(tidyverse)

data.test <-
  data.frame(
    age = c( 1, 1.4, 1, 1.3, 4, 3.2, 3.3, 3.5, 4.3, 3.4, 4.6, 4.7, 0.4, 0.1, 0.3 ), 
    class.type = c( "A","B","B","A","B","A","B","B","B","A","B","B","B","B","A" ))

data.test %>% 
  #age.group  = age as integer
  mutate(age.group = as.integer(age)) %>% 
  #count = the number of observations in each age.group
  add_count(age.group,name = "count") %>% 
  #Grouping by age.group
  group_by(age.group) %>% 
  mutate(
    #a.count = the number of observations with type A in each age.group
    a.count = sum(class.type == "A"),
    #b.count = the number of observations with type B in each age.group
    b.count = sum(class.type == "B"),
    #a.ratio = a.count/count
    a.ratio = a.count/count
    )

Sample Image

calculate statistics and/or a function on multiple columns in subsets of a dataframe in R

I am sure the dplyr solutions are coming, but the doBy package works very well for this kind of thing, unless you have many (millions+) rows, in which case it will be slow.

library(doBy)
df_avg <- summaryBy(. ~ Date + Shift, FUN=c(mean, median, sd), data=df, na.rm=TRUE)

Will give a dataframe with V1.mean, V1.median, and so on.
The . ~ means "summarize all numeric variables". If you want to keep information from some factors in the dataframe, use the argument id.vars = ~somefac+somefac2, for example.

Calculate statistics on subset of a dataframe based on values in dataframe (latitude and longitude)

Cartesian product to itself to get all combinations. This will be expensive on larger datasets. This generates N^2 rows, so in this case 25 rows
calculate distance on each of these combinations
filter query() to distances required
groupby() to get total number of people. Also generate a list of indexes included in total for helping with transparency
finally join() this back together and you have what you want

import geopy.distance as gd

df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
                   'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
                   'people': [1,2,3,4,5]})

df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
 .assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
                                                                 (r.latitude_y,r.longitude_y)).miles, axis=1))
 .query("distance<=0.05")
 .rename(columns={"people_y":"nearby"})
 .groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))

print(df.to_markdown())

	latitude	longitude	people	nearby	index_y
0	40.9919	-106.049	1	6	[0, 1, 2]
1	40.992	-106.049	2	6	[0, 1, 2]
2	40.9916	-106.049	3	6	[0, 1, 2]
3	40.9899	-106.05	4	4	[3]
4	40.9878	-106.049	5	5	[4]

statistics on subsets of a pandas dataframe

Like this? It calculated the mean and standard deviation on each ID grouping from A.

df.groupby('A').agg({'B': ['mean', 'std']})
      B          
   mean       std
A                
1   3.0       NaN
2   3.5  3.535534
4   6.0       NaN
6   5.5  2.121320
7   1.5  0.707107
8   2.5  0.707107

R: T-statistics with subsets

You may easily loop over a subset=.

t(with(mtcars, sapply(unique(cyl), \(i) t.test(am, subset=cyl == i))))
#      statistic parameter p.value      conf.int  estimate null.value stderr     alternative method              data.name
# [1,] 4.605489  31        6.632258e-05 numeric,2 0.40625  0          0.08820997 "two.sided" "One Sample t-test" "am"     
# [2,] 4.605489  31        6.632258e-05 numeric,2 0.40625  0          0.08820997 "two.sided" "One Sample t-test" "am"     
# [3,] 4.605489  31        6.632258e-05 numeric,2 0.40625  0          0.08820997 "two.sided" "One Sample t-test" "am"

More specific for your data you could do this:

tcols <- c('Return', 'Age', 'Size')
r <- t(with(subset(dat, Rating %in% c(1, 5)), 
     sapply(setNames(tcols, tcols), \(i) unlist(
       t.test(reformulate('Rating', i))[
         c('estimate', 'statistic', 'p.value')]
       ))))
cbind(as.data.frame(r),
      ' '=c("   ", "*  ", "** ", "***")[
        rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))])
#        estimate.mean in group 1 estimate.mean in group 5 statistic.t   p.value    
# Return             4.666667e-02                     0.05  -0.1552301 0.8883096    
# Age                5.333333e+00                     5.60  -0.2198599 0.8353634    
# Size               7.188400e+04                 19724.60   4.0457818 0.0109848 *

Note R >= 4.1 used.

Edit

as.data.frame(t(with(subset(dat, Rating %in% c(1, 5)), 
       sapply(setNames(tcols, tcols), \(i) unlist(
         t.test(reformulate('Rating', i))[
           c('estimate', 'statistic', 'p.value')]
       ))))) |>
  {\(.) cbind(mean.diff.5.1=apply(.[1:2], 1, diff), .[3:4])}() |> 
  cbind(' '=c("   ", "*  ", "** ", "***")[
          rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))],
        `colnames<-`(t(aggregate(. ~ Rating, dat[3:6], mean)[-1]), 
                     paste0('mean.rating.', 1:5))) |>
  {\(.) .[c(5:9, 1:4)]}()
#        mean.rating.1 mean.rating.2 mean.rating.3 mean.rating.4 mean.rating.5 mean.diff.5.1 statistic.t   p.value    
# Return  4.666667e-02          0.07          0.01        0.0525          0.05  3.333333e-03  -0.1552301 0.8883096    
# Age     5.333333e+00          7.00          9.00        3.2500          5.60  2.666667e-01  -0.2198599 0.8353634    
# Size    7.188400e+04      94567.00      50243.00     5201.7500      19724.60 -5.215940e+04   4.0457818 0.0109848 *

Calculating Statistics on Subsets of Data