Summarize Different Columns with Different Functions

Summarize different Columns with different Functions

We can use

 df %>%
    group_by(ID) %>%
    summarise(Cash = sum(Cash), Price = sum(Price), Weight = max(Weight))

If we have many columns, one way would be to do this separately and then join the output together.

 df1 <- df %>% 
          group_by(ID) %>% 
          summarise_each(funs(sum), Cash:Price)
 df2 <- df %>%
          group_by(ID) %>% 
          summarise_each(funs(max), Weight)
 inner_join(df1, df2, by = "ID")
 #      ID  Cash Price Weight
 #   (int) (dbl) (dbl)  (int)
 #1     1   0.6   4.2     82
 #2     2   0.3   1.0     70

Summarize different Columns with different Functions with dplyr in r

You can use across twice in the same summarise call :

library(dplyr)

mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(.cols = c(mpg, disp, hp),.fns = sum), 
            across(.cols = c(drat:qsec),.fns = mean))

#    cyl   mpg  disp    hp  drat    wt  qsec
#  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1     4  293. 1156.   909  4.07  2.29  19.1
#2     6  138. 1283.   856  3.59  3.12  18.0
#3     8  211. 4943.  2929  3.23  4.00  16.8

Efficient summarise of different columns with different functions with dplyr

Here is the tidyverse approach adapted from @MichaelDewar's answer, it's tidier but I don't think there is any real improvement in the efficiency. Besides, a dataframe with 100k rows is not that big a deal in my opinion. I think a tidyverse solution is just fine.

library(dplyr)

data %>% 
  group_by(id, date) %>% 
  summarise(
    across(c(var1, var2), ~paste(unique(.), collapse = " AND ")), 
    across(var3, max, na.rm = T), 
    count = n(), .groups = "keep"
  )

But if you really want to improve the efficiency, perhaps try this data.table solution

library(data.table)

setDT(data)[, c(
  lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")), 
  list(var3 = max(var3, na.rm = T), count = .N)
), by = c("id", "date")]

Benchmark

set.seed(2020)
data2 <- data[sample.int(nrow(data), 1e5, T), ]
data22 <- data.table::copy(data2)

f1 <- 
  . %>% 
  group_by(id, date) %>%
  {data.frame(
    summarise_at(., vars(var1, var2), list(~ paste(unique(.), collapse = " AND "))), # return character string
    summarise_at(., vars(var3), list(~ max(., na.rm = T))),#, # return max in group
    summarise(., count = n(), .groups = "keep") # return count of cases in group
  )} %>% 
  select(-matches("[.]1$|[.]2$|[.]3$")) %>% # remove unwanted columns
  as_tibble()

f2 <- 
  . %>% 
  group_by(id, date) %>% 
  summarise(
    across(c(var1, var2), ~paste(unique(.), collapse = " AND ")), 
    across(var3, max, na.rm = T), 
    count = n(), 
    .groups = "keep"
  )

f3 <- function(dt) {
  setDT(dt)[, c(
    lapply(c(var1 = "var1", var2 = "var2"), function(x) paste(unique(.SD[[x]]), collapse = " AND ")), 
    list(var3 = max(var3, na.rm = T), count = .N)
  ), by = c("id", "date")]
}

microbenchmark::microbenchmark(f1(data2), f2(data2), f3(data22))

Result

Unit: milliseconds
       expr     min       lq      mean   median       uq     max neval cld
  f1(data2) 19.6730 20.27990 20.841344 20.50850 20.85045 29.2799   100   c
  f2(data2) 13.5455 14.09240 14.705967 14.34585 14.64625 20.5914   100  b 
 f3(data22)  6.9186  7.80615  8.598227  8.32035  8.68040 15.8358   100 a

Succinct way to summarize different columns with different functions

I propose 2 tricks to solve this issue, see the code and some details for both solutions at the bottom :

A function .at that returns results for for groups of variables (here only one variable by group) that we can then unsplice, so we benefit from both worlds, summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), mean),
  !!!.at(vars(another_annoyingly_long_name), sum))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <dbl>
#   1                                             5.5                          255

An adverb to summarize, with a dollar notation shorthand.

df %>%
  ..flx$summarize(potentially_long_name_i_dont_want_to_type_twice = ~mean(.),
                  another_annoyingly_long_name = ~sum(.))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <int>
#   1                                             5.5                          255

code for .at

It has to be used in a pipe because it uses the . in the parent environment, messy but it works.

.at <- function(.vars, .funs, ...) {
  in_a_piped_fun <- exists(".",parent.frame()) &&
    length(ls(envir=parent.frame(), all.names = TRUE)) == 1
  if (!in_a_piped_fun)
    stop(".at() must be called as an argument to a piped function")
  .tbl <- try(eval.parent(quote(.)))
  dplyr:::manip_at(
    .tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
    .include_group_vars = TRUE, ...)
}

I designed it to combine summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), list(foo=min, bar = max)),
  !!!.at(vars(another_annoyingly_long_name), median))

# # A tibble: 1 x 3
#       foo   bar another_annoyingly_long_name
#     <dbl> <dbl>                        <dbl>
#   1     1    10                         25.5

code for ..flx

..flx outputs a function that replaces its formula arguments such as a = ~mean(.) by calls a = purrr::as_mapper(~mean(.))(a) before running. Convenient with summarize and mutate because a column cannot be a formula so there can't be any conflict.

I like to use the dollar notation as a shorthand and to have names starting with .. so I can name those "tags" (and give them a class "tag") and see them as different objects (still experimenting with this). ..flx(summarize)(...) will work as well though.

..flx <- function(fun){
  function(...){
    mc <- match.call()
    mc[[1]] <- tail(mc[[1]],1)[[1]]
    mc[] <- imap(mc,~if(is.call(.) && identical(.[[1]],quote(`~`))) {
      rlang::expr(purrr::as_mapper(!!.)(!!sym(.y))) 
    } else .)
    eval.parent(mc)
  }
}

class(..flx) <- "tag"

`$.tag` <- function(e1, e2){
  # change original call so x$y, which is `$.tag`(tag=x, data=y), becomes x(y)
  mc <- match.call()
  mc[[1]] <- mc[[2]]
  mc[[2]] <- NULL
  names(mc) <- NULL
  # evaluate it in parent env
  eval.parent(mc)
}

Using dplyr summarize with different operations for multiple columns

As other people have mentioned, this is normally done by calling summarize_each / summarize_at / summarize_if for every group of columns that you want to apply the summarizing function to. As far as I know, you would have to create a custom function that performs summarizations to each subset. You can for example set the colnames in such way that you can use the select helpers (e.g. contains()) to filter just the columns that you want to apply the function to. If not, then you can set the specific column numbers that you want to summarize.

For the example you mentioned, you could try the following:

summarizer <- function(tb, colsone, colstwo, colsthree, 
                       funsone, funstwo, funsthree, group_name) {

return(bind_cols(
    summarize_all(select(tb, colsone), .funs = funsone),
    summarize_all(select(tb, colstwo), .funs = funstwo) %>% 
      ungroup() %>% select(-matches(group_name)),
    summarize_all(select(tb, colsthree), .funs = funsthree) %>% 
      ungroup() %>% select(-matches(group_name)) 
))

}

#With colnames
iris %>% as.tibble() %>% 
  group_by(Species) %>% 
  summarizer(colsone = contains("Sepal"), 
         colstwo = matches("Petal.Length"), 
         colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
         funsone = "sum", 
         funstwo = "mean",
         funsthree = "first",
         group_name = "Species")

#With indexes
iris %>% as.tibble() %>% 
 group_by(Species) %>% 
 summarizer(colsone = 1:2, 
         colstwo = 3, 
         colsthree = 4,
         funsone = "sum", 
         funstwo = "mean",
         funsthree = "first",
         group_name = "Species")

summarise_at using different functions for different variables

Here is one idea.

library(tidyverse)

df_mean <- df %>%
  group_by(category) %>%
  summarize_at(vars(x), funs(mean(.)))

df_median <- df %>%
  group_by(category) %>%
  summarize_at(vars(y), funs(median(.)))

df_first <- df %>%
  group_by(category) %>%
  summarize_at(vars(z), funs(first(.)))

df_summary <- reduce(list(df_mean, df_median, df_first), 
                     left_join, by = "category")

Like you said, there is no need to use summarise_at for this example. However, if you have a lot of columns need to be summarized by different functions, this strategy may work. You will need to specify the columns in the vars(...) for each summarize_at. The rule is the same as the dplyr::select function.

Update

Here is another idea. Define a function which modifies the summarise_at function, and then use map2 to apply this function with a look-up list showing variables and associated functions to apply. In this example, I applied mean to x and y column and median to z.

# Define a function
summarise_at_fun <- function(variable, func, data){
  data2 <- data %>%
    summarise_at(vars(variable), funs(get(func)(.)))
  return(data2)
}

# Group the data
df2 <- df %>% group_by(category)

# Create a look-up list with function names and variable to apply
look_list <- list(mean = c("x", "y"),
                  median = "z")

# Apply the summarise_at_fun
map2(look_list, names(look_list), summarise_at_fun, data = df2) %>%
  reduce(left_join, by = "category")

# A tibble: 3 x 4
  category     x     y     z
     <chr> <dbl> <dbl> <dbl>
1        a     6     6     0
2        b     5     3     8
3        c     2     6     1

How to apply different functions to different columns after groupby like sum and .apply(list)? (Python)

Use groupby.aggregate

df.groupby('id').agg({k: sum for k in ['a', 'b', 'c', 'd', 'group']})

A one-liner alternative would be using numeric_only flag. But be careful with the columns you are feeding in.

df.groupby('id').sum(numeric_only=False)

Output

           a         b         c         d                     group
id
1   1.488778  0.802794  0.949768  0.952676  [0, 1, 2, 3, 1, 1, 1, 1]
2   0.488390  0.512301  0.064922  0.233875              [0, 2, 3, 4]
4   0.649945  0.267125  0.229313  0.156696                         1