Group by Using Base R

Group by using base R

Here's another base R solution using by

do.call(rbind, by(df, df[, 1:3], 
function(x) cbind(x[1, 1:3], sum(x$sales), mean(x$units))))

Or using "split\apply\combine" theory

t(sapply(split(df, df[, 1:3], drop = TRUE), 
function(x) c(sumSales = sum(x$sales), meanUnits = mean(x$units))))

Or similarly

do.call(rbind, lapply(split(df, df[, 1:3], drop = TRUE), 
function(x) c(sumSales = sum(x$sales), meanUnits = mean(x$units))))

Edit: it seems like df is of class data.table (but you for some reason asked for base R solution only), here's how you would do it with your data.table object

df[, .(sumSales = sum(sales), meanUnits = mean(units)), keyby = .(year, quarter, Channel)]
# year quarter Channel sumSales meanUnits
# 1: 2013 Q1 AAA 4855 15.0
# 2: 2013 Q1 BBB 2231 12.0
# 3: 2013 Q2 AAA 4004 17.5
# 4: 2013 Q2 BBB 2057 23.0
# 5: 2013 Q3 AAA 2558 21.0
# 6: 2013 Q3 BBB 4807 21.0
# 7: 2013 Q4 AAA 4291 12.0
# 8: 2013 Q4 BBB 1128 25.0
# 9: 2014 Q1 AAA 2169 23.0
# 10: 2014 Q1 CCC 3912 16.5
# 11: 2014 Q2 AAA 2613 21.0
# 12: 2014 Q2 BBB 1533 11.0
# 13: 2014 Q2 CCC 2114 23.0
# 14: 2014 Q3 BBB 5219 13.0
# 15: 2014 Q3 CCC 1614 15.0
# 16: 2014 Q4 AAA 2695 14.0
# 17: 2014 Q4 BBB 4177 15.0

Base R instead of dplyr: group and summarise the data?

One way to do this is to use aggregate. This is the most straightforward base method, I think. You can use other functions as well, but this one is the easiest to follow.

aggregate(Sport ~ Sex + Season, data = data, 
FUN = function(x) length(unique(x)) )
Sex Season Sport
1 F Summer 40
2 M Summer 49
3 F Winter 14
4 M Winter 17

What is the Base R equivalent of this dplyr group_by code?

We could use proportions on the table output after subsetting to remove the NA (complete.cases) and selecting the columns

The data is from forcats package. So, load the package and get the data

library(forcats)
data(gss_cat)

Use the table/proportions as mentioned above

by_age2_base <- proportions(table(subset(gss_cat, complete.cases(age), 
select = c(age, marital))), 1)

-output

head(by_age2_base, 3)
marital
age No answer Never married Separated Divorced Widowed Married
18 0.000000000 0.978021978 0.000000000 0.000000000 0.000000000 0.021978022
19 0.000000000 0.939759036 0.000000000 0.012048193 0.004016064 0.044176707
20 0.000000000 0.904382470 0.003984064 0.007968127 0.000000000 0.083665339

-compare with the OP's output

head(by_age2, 3)
# A tibble: 3 x 4
# Groups: age [2]
age marital n prop
<int> <fct> <int> <dbl>
1 18 Never married 89 0.978
2 18 Married 2 0.0220
3 19 Never married 234 0.940

If we need the output in 'long' format, convert the table to data.frame with as.data.frame

by_age2_base_long <- subset(as.data.frame(by_age2_base), Freq > 0)

Or another option is aggregate/ave (use R 4.1.0)

subset(gss_cat, complete.cases(age), select = c(age, marital)) |> 
{\(dat) aggregate(cbind(n = age) ~ age + marital,
data = dat, FUN = length)}() |>
transform(prop = ave(n, age, FUN = \(x) x/sum(x)))

How can I group variables in when dplyr and base R functions don't work?

If you are just looking for the unique rows of MUN_RESID and V16 - you can use the duplicated function

months0606[ !duplicated( months0606[ , c( "MUN_RESID","V16")]) , ]

since you are dealing with a large data set you could consider data.table but you need to decide what operations you are doing by your groups. I took the means, in your example it matches the duplicated function, but wouldn't if there were differences in either of the X08 vars

library( data.table )
months0606 <- data.table( months0606 )
months0606[ , .(
X08.2005_P=mean(X08.2005_P),
X09.2005_P=mean( X09.2005_P)
),
by=c("MUN_RESID" , "V16" )]

Question Using group_by/summarise or group_by/mutate in Base R

In base R an option is by

by(test, test[c('ctr_n', 'yr', 'mn', 'pty')], FUN = function(x) ineq(x$vote.shares, NULL, type = "Gini", na.rm = TRUE))

Or another option is split

out <- do.call(rbind, lapply(split(test, test[c('ctr_n', 'yr', 'mn', 'pty')],
drop = TRUE), function(x) data.frame(x[1,],
giniI = ineq(x$vote.shares, NULL, type = "Gini", na.rm = TRUE))))
row.names(out) <- NULL

[[base R]] Add T/F column for whether it's the minimum value for each group


df1 <- transform(df, cheapest = ave(weight, item, FUN = min) == weight)
df1
item weight cheapest
1 apple 700 FALSE
2 apple 500 TRUE
3 orange 500 FALSE
4 peach 200 TRUE
5 apple 900 FALSE
6 orange 200 TRUE

Running multiple T-Test on variables with groupings in R (not using rstatix)

The error relates to the number of observations in 'Grouping'. There is a case of having 1 observation. With base R, we can do this as

lapply(split(df, df$Item), function(x) if(any(table(x$Grouping) < 2)) 
NA else t.test(Cost ~ Grouping, data = x))

-output

$`Book A`

Welch Two Sample t-test

data: Cost by Grouping
t = -1.3416, df = 1.4706, p-value = 0.3499
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
-8.418523 5.418523
sample estimates:
mean in group A mean in group B
6.5 8.0


$`Book B`
[1] NA

$`Book C`

Welch Two Sample t-test

data: Cost by Grouping
t = 1.3868, df = 1.8989, p-value = 0.3059
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
-5.666332 10.666332
sample estimates:
mean in group A mean in group B
5.5 3.0


$`Book D`

Welch Two Sample t-test

data: Cost by Grouping
t = -0.42857, df = 1, p-value = 0.7422
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
-45.97172 42.97172
sample estimates:
mean in group A mean in group B
4.0 5.5

Or getting the pvalue

stack(lapply(split(df, df$Item), function(x) if(any(table(x$Grouping) < 2)) 
NA else t.test(Cost ~ Grouping, data = x)$p.value))[2:1]
ind values
1 Book A 0.3498856
2 Book B NA
3 Book C 0.3058987
4 Book D 0.7422379

The same approach can be done with dplyr

library(dplyr)
df %>%
add_count(Item, Grouping) %>%
group_by(Item) %>%
summarise(out = list(if(any(n < 2)) NA else t.test(Cost ~ Grouping)))

-output

# A tibble: 4 × 2
Item out
<fct> <list>
1 Book A <htest>
2 Book B <lgl [1]>
3 Book C <htest>
4 Book D <htest>

If it needs only the pvalue

df %>% 
add_count(Item, Grouping) %>%
group_by(Item) %>%
summarise(out = if(any(n < 2)) NA_real_ else t.test(Cost ~ Grouping)$p.value)
# A tibble: 4 × 2
Item out
<fct> <dbl>
1 Book A 0.350
2 Book B NA
3 Book C 0.306
4 Book D 0.742

Remove groups with only one individual in R without using dplyr package

Or another option is with tidyverse - after grouping by 'group', filter the rows where the number of distinct (n_distinct) elements in 'individualID' is greater than 1

library(dplyr)
df1 %>%
group_by(group) %>%
filter(n_distinct(individualID) > 1) %>%
ungroup
# A tibble: 8 × 3
group individualID X
<dbl> <dbl> <int>
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
5 3 5 0
6 3 5 0
7 3 6 1
8 3 6 0

Or with subset and ave from base R

subset(df1, ave(individualID, group, FUN = function(x) length(unique(x))) > 1)
group individualID X
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
7 3 5 0
8 3 5 0
9 3 6 1
10 3 6 0

How to sum a variable by group

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))


Related Topics



Leave a reply



Submit