Calculating statistics on subsets of data
Use the base R function ave()
, which despite its confusing name, can calculate a variety of statistics, including the mean
:
within(mydata, mean<-ave(measure, subject, FUN=mean))
subject time measure mean
1 1 0 10 10.000000
2 1 1 12 10.000000
3 1 2 8 10.000000
4 2 0 7 2.333333
5 2 1 0 2.333333
6 2 2 0 2.333333
Note that I use within
just for the sake of shorter code. Here is the equivalent without within()
:
mydata$mean <- ave(mydata$measure, mydata$subject, FUN=mean)
mydata
subject time measure mean
1 1 0 10 10.000000
2 1 1 12 10.000000
3 1 2 8 10.000000
4 2 0 7 2.333333
5 2 1 0 2.333333
6 2 2 0 2.333333
Subset data.frame in R and calculate summary statistics, recombine with zero row subsets
If I understood correctly, here is a tidyverse solution
library(tidyverse)
data.test <-
data.frame(
age = c( 1, 1.4, 1, 1.3, 4, 3.2, 3.3, 3.5, 4.3, 3.4, 4.6, 4.7, 0.4, 0.1, 0.3 ),
class.type = c( "A","B","B","A","B","A","B","B","B","A","B","B","B","B","A" ))
data.test %>%
#age.group = age as integer
mutate(age.group = as.integer(age)) %>%
#count = the number of observations in each age.group
add_count(age.group,name = "count") %>%
#Grouping by age.group
group_by(age.group) %>%
mutate(
#a.count = the number of observations with type A in each age.group
a.count = sum(class.type == "A"),
#b.count = the number of observations with type B in each age.group
b.count = sum(class.type == "B"),
#a.ratio = a.count/count
a.ratio = a.count/count
)
calculate statistics and/or a function on multiple columns in subsets of a dataframe in R
I am sure the dplyr
solutions are coming, but the doBy
package works very well for this kind of thing, unless you have many (millions+) rows, in which case it will be slow.
library(doBy)
df_avg <- summaryBy(. ~ Date + Shift, FUN=c(mean, median, sd), data=df, na.rm=TRUE)
Will give a dataframe with V1.mean
, V1.median
, and so on.
The . ~
means "summarize all numeric variables". If you want to keep information from some factors in the dataframe, use the argument id.vars = ~somefac+somefac2
, for example.
Calculate statistics on subset of a dataframe based on values in dataframe (latitude and longitude)
- Cartesian product to itself to get all combinations. This will be expensive on larger datasets. This generates N^2 rows, so in this case 25 rows
- calculate distance on each of these combinations
- filter
query()
to distances required groupby()
to get total number of people. Also generate alist
of indexes included in total for helping with transparency- finally
join()
this back together and you have what you want
import geopy.distance as gd
df = pd.DataFrame({'latitude': [40.991919 , 40.992001 , 40.991602, 40.989903, 40.987759],
'longitude': [-106.049469, -106.048812, -106.048904, -106.049907, -106.048840],
'people': [1,2,3,4,5]})
df = df.join((df.reset_index().assign(foo=1).merge(df.reset_index().assign(foo=1), on="foo")
.assign(distance=lambda dfa: dfa.apply(lambda r: gd.distance((r.latitude_x,r.longitude_x),
(r.latitude_y,r.longitude_y)).miles, axis=1))
.query("distance<=0.05")
.rename(columns={"people_y":"nearby"})
.groupby("index_x").agg({"nearby":"sum","index_y":lambda x: list(x)})
))
print(df.to_markdown())
latitude | longitude | people | nearby | index_y | |
---|---|---|---|---|---|
0 | 40.9919 | -106.049 | 1 | 6 | [0, 1, 2] |
1 | 40.992 | -106.049 | 2 | 6 | [0, 1, 2] |
2 | 40.9916 | -106.049 | 3 | 6 | [0, 1, 2] |
3 | 40.9899 | -106.05 | 4 | 4 | [3] |
4 | 40.9878 | -106.049 | 5 | 5 | [4] |
statistics on subsets of a pandas dataframe
Like this? It calculated the mean and standard deviation on each ID grouping from A.
df.groupby('A').agg({'B': ['mean', 'std']})
B
mean std
A
1 3.0 NaN
2 3.5 3.535534
4 6.0 NaN
6 5.5 2.121320
7 1.5 0.707107
8 2.5 0.707107
R: T-statistics with subsets
You may easily loop over a subset=
.
t(with(mtcars, sapply(unique(cyl), \(i) t.test(am, subset=cyl == i))))
# statistic parameter p.value conf.int estimate null.value stderr alternative method data.name
# [1,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
# [2,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
# [3,] 4.605489 31 6.632258e-05 numeric,2 0.40625 0 0.08820997 "two.sided" "One Sample t-test" "am"
More specific for your data you could do this:
tcols <- c('Return', 'Age', 'Size')
r <- t(with(subset(dat, Rating %in% c(1, 5)),
sapply(setNames(tcols, tcols), \(i) unlist(
t.test(reformulate('Rating', i))[
c('estimate', 'statistic', 'p.value')]
))))
cbind(as.data.frame(r),
' '=c(" ", "* ", "** ", "***")[
rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))])
# estimate.mean in group 1 estimate.mean in group 5 statistic.t p.value
# Return 4.666667e-02 0.05 -0.1552301 0.8883096
# Age 5.333333e+00 5.60 -0.2198599 0.8353634
# Size 7.188400e+04 19724.60 4.0457818 0.0109848 *
Note R >= 4.1 used.
Edit
as.data.frame(t(with(subset(dat, Rating %in% c(1, 5)),
sapply(setNames(tcols, tcols), \(i) unlist(
t.test(reformulate('Rating', i))[
c('estimate', 'statistic', 'p.value')]
))))) |>
{\(.) cbind(mean.diff.5.1=apply(.[1:2], 1, diff), .[3:4])}() |>
cbind(' '=c(" ", "* ", "** ", "***")[
rowSums(outer(r[, 'p.value'], c(Inf, 0.05, 0.01, 0.001), `<`))],
`colnames<-`(t(aggregate(. ~ Rating, dat[3:6], mean)[-1]),
paste0('mean.rating.', 1:5))) |>
{\(.) .[c(5:9, 1:4)]}()
# mean.rating.1 mean.rating.2 mean.rating.3 mean.rating.4 mean.rating.5 mean.diff.5.1 statistic.t p.value
# Return 4.666667e-02 0.07 0.01 0.0525 0.05 3.333333e-03 -0.1552301 0.8883096
# Age 5.333333e+00 7.00 9.00 3.2500 5.60 2.666667e-01 -0.2198599 0.8353634
# Size 7.188400e+04 94567.00 50243.00 5201.7500 19724.60 -5.215940e+04 4.0457818 0.0109848 *
Related Topics
Create Group Number For Contiguous Runs of Equal Values
How to Unload a Package Without Restarting R
What Does %≫% Function Mean in R
Dplyr Mutate/Replace Several Columns on a Subset of Rows
Subscript Out of Bounds - General Definition and Solution
Ggplot2 - Jitter and Position Dodge Together
Simplest Way to Do Grouped Barplot
How to Subtract Months from a Date in R
How to Convert Dataframe into Time Series
Remove Parentheses and Text Within from Strings in R
How to Subtract/Add Days From/To a Date
Why Do I Get "Warning Longer Object Length Is Not a Multiple of Shorter Object Length"
Format Numbers With Million (M) and Billion (B) Suffixes
Index Values from a Matrix Using Row, Col Indices
Count Nas Per Row in Dataframe