Group by Using Base R

Group by using base R

Here's another base R solution using by

do.call(rbind, by(df, df[, 1:3], 
                  function(x) cbind(x[1, 1:3], sum(x$sales), mean(x$units))))

Or using "split\apply\combine" theory

t(sapply(split(df, df[, 1:3], drop = TRUE), 
                   function(x) c(sumSales = sum(x$sales), meanUnits = mean(x$units))))

Or similarly

do.call(rbind, lapply(split(df, df[, 1:3], drop = TRUE), 
                     function(x) c(sumSales = sum(x$sales), meanUnits = mean(x$units))))

Edit: it seems like df is of class data.table (but you for some reason asked for base R solution only), here's how you would do it with your data.table object

df[, .(sumSales = sum(sales), meanUnits = mean(units)), keyby = .(year, quarter, Channel)]
#     year quarter Channel sumSales meanUnits
#  1: 2013      Q1     AAA     4855      15.0
#  2: 2013      Q1     BBB     2231      12.0
#  3: 2013      Q2     AAA     4004      17.5
#  4: 2013      Q2     BBB     2057      23.0
#  5: 2013      Q3     AAA     2558      21.0
#  6: 2013      Q3     BBB     4807      21.0
#  7: 2013      Q4     AAA     4291      12.0
#  8: 2013      Q4     BBB     1128      25.0
#  9: 2014      Q1     AAA     2169      23.0
# 10: 2014      Q1     CCC     3912      16.5
# 11: 2014      Q2     AAA     2613      21.0
# 12: 2014      Q2     BBB     1533      11.0
# 13: 2014      Q2     CCC     2114      23.0
# 14: 2014      Q3     BBB     5219      13.0
# 15: 2014      Q3     CCC     1614      15.0
# 16: 2014      Q4     AAA     2695      14.0
# 17: 2014      Q4     BBB     4177      15.0

Base R instead of dplyr: group and summarise the data?

One way to do this is to use aggregate. This is the most straightforward base method, I think. You can use other functions as well, but this one is the easiest to follow.

aggregate(Sport ~ Sex + Season, data = data, 
          FUN = function(x) length(unique(x)) )
  Sex Season Sport
1   F Summer    40
2   M Summer    49
3   F Winter    14
4   M Winter    17

What is the Base R equivalent of this dplyr group_by code?

We could use proportions on the table output after subsetting to remove the NA (complete.cases) and selecting the columns

The data is from forcats package. So, load the package and get the data

library(forcats)
data(gss_cat)

Use the table/proportions as mentioned above

by_age2_base <- proportions(table(subset(gss_cat, complete.cases(age), 
       select = c(age, marital))), 1)

-output

head(by_age2_base, 3)
    marital
age    No answer Never married   Separated    Divorced     Widowed     Married
  18 0.000000000   0.978021978 0.000000000 0.000000000 0.000000000 0.021978022
  19 0.000000000   0.939759036 0.000000000 0.012048193 0.004016064 0.044176707
  20 0.000000000   0.904382470 0.003984064 0.007968127 0.000000000 0.083665339

-compare with the OP's output

head(by_age2, 3)
# A tibble: 3 x 4
# Groups:   age [2]
    age marital           n   prop
  <int> <fct>         <int>  <dbl>
1    18 Never married    89 0.978 
2    18 Married           2 0.0220
3    19 Never married   234 0.940

If we need the output in 'long' format, convert the table to data.frame with as.data.frame

by_age2_base_long <- subset(as.data.frame(by_age2_base), Freq > 0)

Or another option is aggregate/ave (use R 4.1.0)

subset(gss_cat, complete.cases(age), select = c(age, marital)) |> 
    {\(dat) aggregate(cbind(n = age) ~ age + marital, 
      data = dat, FUN = length)}() |> 
   transform(prop = ave(n, age, FUN = \(x) x/sum(x)))

How can I group variables in when dplyr and base R functions don't work?

If you are just looking for the unique rows of MUN_RESID and V16 - you can use the duplicated function

months0606[ !duplicated( months0606[ , c( "MUN_RESID","V16")]) , ]

since you are dealing with a large data set you could consider data.table but you need to decide what operations you are doing by your groups. I took the means, in your example it matches the duplicated function, but wouldn't if there were differences in either of the X08 vars

library( data.table )
months0606 <- data.table( months0606 )
months0606[ , .( 
    X08.2005_P=mean(X08.2005_P),
    X09.2005_P=mean( X09.2005_P)
    ),
    by=c("MUN_RESID" ,  "V16" )]

Question Using group_by/summarise or group_by/mutate in Base R

In base R an option is by

by(test, test[c('ctr_n', 'yr', 'mn', 'pty')], FUN = function(x) ineq(x$vote.shares, NULL, type = "Gini", na.rm = TRUE))

Or another option is split

out <- do.call(rbind, lapply(split(test, test[c('ctr_n', 'yr', 'mn', 'pty')],
       drop = TRUE), function(x) data.frame(x[1,],
     giniI = ineq(x$vote.shares, NULL, type = "Gini", na.rm = TRUE))))
row.names(out) <- NULL

[[base R]] Add T/F column for whether it's the minimum value for each group

df1 <- transform(df, cheapest = ave(weight, item, FUN = min) == weight)
df1
    item weight cheapest
1  apple    700    FALSE
2  apple    500     TRUE
3 orange    500    FALSE
4  peach    200     TRUE
5  apple    900    FALSE
6 orange    200     TRUE

Running multiple T-Test on variables with groupings in R (not using rstatix)

The error relates to the number of observations in 'Grouping'. There is a case of having 1 observation. With base R, we can do this as

lapply(split(df, df$Item), function(x) if(any(table(x$Grouping) < 2)) 
      NA else t.test(Cost ~ Grouping, data = x))

-output

$`Book A`

    Welch Two Sample t-test

data:  Cost by Grouping
t = -1.3416, df = 1.4706, p-value = 0.3499
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -8.418523  5.418523
sample estimates:
mean in group A mean in group B 
            6.5             8.0 


$`Book B`
[1] NA

$`Book C`

    Welch Two Sample t-test

data:  Cost by Grouping
t = 1.3868, df = 1.8989, p-value = 0.3059
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -5.666332 10.666332
sample estimates:
mean in group A mean in group B 
            5.5             3.0 


$`Book D`

    Welch Two Sample t-test

data:  Cost by Grouping
t = -0.42857, df = 1, p-value = 0.7422
alternative hypothesis: true difference in means between group A and group B is not equal to 0
95 percent confidence interval:
 -45.97172  42.97172
sample estimates:
mean in group A mean in group B 
            4.0             5.5

Or getting the pvalue

stack(lapply(split(df, df$Item), function(x) if(any(table(x$Grouping) < 2)) 
      NA else t.test(Cost ~ Grouping, data = x)$p.value))[2:1]
  ind    values
1 Book A 0.3498856
2 Book B        NA
3 Book C 0.3058987
4 Book D 0.7422379

The same approach can be done with dplyr

library(dplyr)
df %>% 
  add_count(Item, Grouping) %>%
  group_by(Item) %>%
   summarise(out = list(if(any(n < 2)) NA else t.test(Cost ~ Grouping)))

-output

# A tibble: 4 × 2
  Item   out      
  <fct>  <list>   
1 Book A <htest>  
2 Book B <lgl [1]>
3 Book C <htest>  
4 Book D <htest>

If it needs only the pvalue

df %>% 
  add_count(Item, Grouping) %>%
  group_by(Item) %>%
   summarise(out = if(any(n < 2)) NA_real_ else t.test(Cost ~ Grouping)$p.value)
# A tibble: 4 × 2
  Item      out
  <fct>   <dbl>
1 Book A  0.350
2 Book B NA    
3 Book C  0.306
4 Book D  0.742

Remove groups with only one individual in R without using dplyr package

Or another option is with tidyverse - after grouping by 'group', filter the rows where the number of distinct (n_distinct) elements in 'individualID' is greater than 1

library(dplyr)
df1 %>%
    group_by(group) %>% 
    filter(n_distinct(individualID) > 1) %>%
    ungroup
# A tibble: 8 × 3
  group individualID     X
  <dbl>        <dbl> <int>
1     1            1     0
2     1            1     0
3     1            2     1
4     1            2     1
5     3            5     0
6     3            5     0
7     3            6     1
8     3            6     0

Or with subset and ave from base R

subset(df1, ave(individualID, group, FUN = function(x) length(unique(x))) > 1)
   group individualID X
1      1            1 0
2      1            1 0
3      1            2 1
4      1            2 1
7      3            5 0
8      3            5 0
9      3            6 1
10     3            6 0

How to sum a variable by group

Using aggregate:

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

In the example above, multiple dimensions can be specified in the list. Multiple aggregated metrics of the same data type can be incorporated via cbind:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(embedding @thelatemail comment), aggregate has a formula interface too

aggregate(Frequency ~ Category, x, sum)

Or if you want to aggregate multiple columns, you could use the . notation (works for one column too)

aggregate(. ~ Category, x, sum)

or tapply:

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34

Using this data:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))

Group by Using Base R