Apply a Function to Groups Within a Data.Frame in R

Run a custom function on a data frame in R, by group

Using dplyr

library(dplyr)
df %>%
group_by(tm) %>%
do(data.frame(val=calc(.)))
# tm val
#1 1 1.665882
#2 2 1.504545
#3 3 1.838000

If we change the function slightly to include multiple arguments, this could also work with summarise

 calc1 <- function(d1, t1, h1, p1){
(1.27*sum(d1) + 1.62*sum(t1) + 2.10*sum(h1) )/sum(p1) }
df %>%
group_by(tm) %>%
summarise(val=calc1(d, t, h, p))
# tm val
#1 1 1.665882
#2 2 1.504545
#3 3 1.838000

Applying functions to each group in a dataframe in R

You could make great use of dplyr's group_by syntax here after creating some bespoke functions to do the required tasks:

# Replace the last NA element of a vector with 'ok'
replace_first_na <- function(x) {

# Coerce to character to catch potential issues
x <- as.character(x)

# Get the position of the first NA
first_na <- which(is.na(x))[1]

# Replace the element in that position with 'ok'
x[first_na] <- "ok"

x

}

# Get the last element containing the word 'bought'
last_bought_flag <- function(x) {

# Prepare the output
out <- rep(0, length(x))

# Get the position of the last string to contain 'bought'
last_bought <- max(which(grepl("bought", x)))

# Replace the element in that position with `1`
out[last_bought] <- 1

# Return the output
out

}

df %>%
as_tibble() %>%

# Apply grouping by `number`
group_by(number) %>%

# Replace the first `NA` with 'ok' in the `status` column
mutate(status = replace_first_na(status)) %>%

# Get a flag column indicating the last 'bought' item for each group
mutate(last_bought = last_bought_flag(info)) %>%

# Remove grouping
ungroup()

R: Group by and Apply a general function to two columns

Update based on real-life example:

You can do a direct approach like this:

library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)

# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9

The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.

You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.

I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.


In your example, you can do:

my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}

library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()

# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11

But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.

Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.

Apply a function to groups within a data.frame in R

I would use ave. If you look at the source of ave, you'll see it essentially wraps Martin Morgan's solution.

R> g <- factor(c("a","b","a","b","a","b","a","b","a","b","a","b"))
R> v <- c(1,4,1,4,1,4,2,8,2,8,2,8)
R> d <- data.frame(g,v)
R> d$cs <- ave(v, g, FUN=cumsum)
R> d
g v cs
1 a 1 1
2 b 4 4
3 a 1 2
4 b 4 8
5 a 1 3
6 b 4 12
7 a 2 5
8 b 8 20
9 a 2 7
10 b 8 28
11 a 2 9
12 b 8 36

R apply function to groups within data frame adding result as additional column

You can try this:

library(dplyr)
df %>% group_by(group) %>%
mutate(pnew=ifelse(is.na(time),time,time+5))

# A tibble: 18 x 4
# Groups: group [2]
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 6
7 1 2 7 7
8 1 3 8 8
9 1 4 9 9
10 1 5 10 10
11 1 6 11 11
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 6
17 2 2 17 7
18 2 3 18 8

Update

You can use this function:

increase <- function(data,n)
{
data %>% group_by(group) %>%
mutate(pnew=ifelse(is.na(time),time,time+n)) -> result
return(result)
}

increase(df,n = 10)

# A tibble: 18 x 4
# Groups: group [2]
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 11
7 1 2 7 12
8 1 3 8 13
9 1 4 9 14
10 1 5 10 15
11 1 6 11 16
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 11
17 2 2 17 12
18 2 3 18 13

Update 2

I hope this helps:

df %>% group_by(group) %>% rowwise() %>% mutate(pnew=ifelse(is.na(time),NA,pfunc(time)))

# A tibble: 18 x 4
# Rowwise: group
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 6
7 1 2 7 7
8 1 3 8 8
9 1 4 9 9
10 1 5 10 10
11 1 6 11 11
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 6
17 2 2 17 7
18 2 3 18 8

Applying function to each group and column of R dataframe

You can use the package dplyr. Use group_by to do it for each Category and mutate_if to apply the function to all numerical columns

library(dplyr)
df <- read.table(header = TRUE, text =
" Category a b c
a 2.0 5.0 -5.0
a 1.5 10.0 10.0
b 3.2 14.5 100.2")
replace_outliers <- function(column) {
qnt <- quantile(column, probs=c(.25, .75))
upper_whisker <- 1.5 * IQR(column)
clean_data <- column
clean_data[column > (qnt[2] + upper_whisker)] <- median(column)
clean_data
}

df %>% group_by(Category) %>%
mutate_if(is.numeric, replace_outliers)

Apply a function across groups and columns in data.table and/or dplyr

The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b of dt1 and column b2 of dt2 are supposed to be added.

Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:

  1. Working in long format
  2. EDIT: Update joins using get()
  3. EDIT 2: Computing on the language

1. Working in long format

The information on corresponding columns can be provided in a look-up table or translation table:

library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))

lut
   vars1 vars2
1: a a2
2: b b2
3: c c2

In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.

# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
    groupVar  a  b  c
1: 1 11 22 33
2: 1 12 23 34
3: 1 13 24 35
4: 2 24 35 46
5: 2 25 36 47
6: 2 26 37 48
7: 3 37 48 59
8: 3 38 49 60
9: 3 39 50 61
10: 3 40 51 62

2. Update joins using get()

Giving a second thought, here is an approach which is similar to OP's proposed for loop and requires much less coding:

vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]

for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}

dt1[]
     a  b  c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3

Note that dt1 is updated by reference, i.e., without copying.

Prepending the variable names vars1[iv] by "x." and vars2[iv] by "i." on the right hand side of := is to ensure that the right columns from dt1 and dt2, resp., are picked in case of duplicated column names. See the Advanced: section on the j parameter in help("data.table").

3. Computing on the language

This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.

library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
     a  b  c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3

It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement

dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]

as a character string. This string is then evaluated and executed in one go; no for loops required.

As the helper function EVAL() already uses paste0() the call to glue() can be omitted:

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}

Note that dot . and curly brackets {} are used with different meaning in different contexts which may appear somewhat confusing.

Grouped application of function that return a data.frame (without a for loop)

One option is to use dplyr::group_split() and purrr::map_dfr().

How this works: group_split() will divide your data.frame df into a list of data.frames based on the grouping variables you supply (e.g., g). Next, map_dfr() can be used to apply a function to each element of that list. Because your custom function ff() returns a data.frame without your grouping variable g, you'll want to add that information back to ff() output - this can be accomplished with mutate() as in the example below:

library(dplyr)
library(purrr)

# set seed so that example is reproducible
set.seed(1)

# your example data and function
df <- data.frame(start=1:10,end=21:30,g=sample(LETTERS[1:2],10,replace=TRUE))

ff <- function(start,end,... ) {
out <- data.frame(T1=c(start,rev(start)),T2=c(end,rev(end)))
return(out)
}

# use group_split & map_dfr
df %>%
# divide df into a list of data.frames based on supplied grouping variables
group_split(g) %>%
# for each element in the list, apply this function
map_dfr(function(df.x) {
with(df.x,
# get the data.frame your function returns
ff(start, end) %>%
# add your grouping variables back-in (stripped by ff)
mutate(g = g[1]))
})

# a short-hand version of the above can be written as:
df %>%
group_split(g) %>%
map_dfr(~ff(.x$start, .x$end) %>% mutate(g = .x$g[1]))


Related Topics



Leave a reply



Submit