Run a custom function on a data frame in R, by group
Using dplyr
library(dplyr)
df %>%
group_by(tm) %>%
do(data.frame(val=calc(.)))
# tm val
#1 1 1.665882
#2 2 1.504545
#3 3 1.838000
If we change the function slightly to include multiple arguments, this could also work with summarise
calc1 <- function(d1, t1, h1, p1){
(1.27*sum(d1) + 1.62*sum(t1) + 2.10*sum(h1) )/sum(p1) }
df %>%
group_by(tm) %>%
summarise(val=calc1(d, t, h, p))
# tm val
#1 1 1.665882
#2 2 1.504545
#3 3 1.838000
Applying functions to each group in a dataframe in R
You could make great use of dplyr
's group_by
syntax here after creating some bespoke functions to do the required tasks:
# Replace the last NA element of a vector with 'ok'
replace_first_na <- function(x) {
# Coerce to character to catch potential issues
x <- as.character(x)
# Get the position of the first NA
first_na <- which(is.na(x))[1]
# Replace the element in that position with 'ok'
x[first_na] <- "ok"
x
}
# Get the last element containing the word 'bought'
last_bought_flag <- function(x) {
# Prepare the output
out <- rep(0, length(x))
# Get the position of the last string to contain 'bought'
last_bought <- max(which(grepl("bought", x)))
# Replace the element in that position with `1`
out[last_bought] <- 1
# Return the output
out
}
df %>%
as_tibble() %>%
# Apply grouping by `number`
group_by(number) %>%
# Replace the first `NA` with 'ok' in the `status` column
mutate(status = replace_first_na(status)) %>%
# Get a flag column indicating the last 'bought' item for each group
mutate(last_bought = last_bought_flag(info)) %>%
# Remove grouping
ungroup()
R: Group by and Apply a general function to two columns
Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.
Apply a function to groups within a data.frame in R
I would use ave
. If you look at the source of ave
, you'll see it essentially wraps Martin Morgan's solution.
R> g <- factor(c("a","b","a","b","a","b","a","b","a","b","a","b"))
R> v <- c(1,4,1,4,1,4,2,8,2,8,2,8)
R> d <- data.frame(g,v)
R> d$cs <- ave(v, g, FUN=cumsum)
R> d
g v cs
1 a 1 1
2 b 4 4
3 a 1 2
4 b 4 8
5 a 1 3
6 b 4 12
7 a 2 5
8 b 8 20
9 a 2 7
10 b 8 28
11 a 2 9
12 b 8 36
R apply function to groups within data frame adding result as additional column
You can try this:
library(dplyr)
df %>% group_by(group) %>%
mutate(pnew=ifelse(is.na(time),time,time+5))
# A tibble: 18 x 4
# Groups: group [2]
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 6
7 1 2 7 7
8 1 3 8 8
9 1 4 9 9
10 1 5 10 10
11 1 6 11 11
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 6
17 2 2 17 7
18 2 3 18 8
Update
You can use this function:
increase <- function(data,n)
{
data %>% group_by(group) %>%
mutate(pnew=ifelse(is.na(time),time,time+n)) -> result
return(result)
}
increase(df,n = 10)
# A tibble: 18 x 4
# Groups: group [2]
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 11
7 1 2 7 12
8 1 3 8 13
9 1 4 9 14
10 1 5 10 15
11 1 6 11 16
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 11
17 2 2 17 12
18 2 3 18 13
Update 2
I hope this helps:
df %>% group_by(group) %>% rowwise() %>% mutate(pnew=ifelse(is.na(time),NA,pfunc(time)))
# A tibble: 18 x 4
# Rowwise: group
group time p pnew
<dbl> <int> <int> <dbl>
1 1 NA 1 NA
2 1 NA 2 NA
3 1 NA 3 NA
4 1 NA 4 NA
5 1 NA 5 NA
6 1 1 6 6
7 1 2 7 7
8 1 3 8 8
9 1 4 9 9
10 1 5 10 10
11 1 6 11 11
12 2 NA 12 NA
13 2 NA 13 NA
14 2 NA 14 NA
15 2 NA 15 NA
16 2 1 16 6
17 2 2 17 7
18 2 3 18 8
Applying function to each group and column of R dataframe
You can use the package dplyr
. Use group_by
to do it for each Category and mutate_if
to apply the function to all numerical columns
library(dplyr)
df <- read.table(header = TRUE, text =
" Category a b c
a 2.0 5.0 -5.0
a 1.5 10.0 10.0
b 3.2 14.5 100.2")
replace_outliers <- function(column) {
qnt <- quantile(column, probs=c(.25, .75))
upper_whisker <- 1.5 * IQR(column)
clean_data <- column
clean_data[column > (qnt[2] + upper_whisker)] <- median(column)
clean_data
}
df %>% group_by(Category) %>%
mutate_if(is.numeric, replace_outliers)
Apply a function across groups and columns in data.table and/or dplyr
The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b
of dt1
and column b2
of dt2
are supposed to be added.
Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:
- Working in long format
- EDIT: Update joins using
get()
- EDIT 2: Computing on the language
1. Working in long format
The information on corresponding columns can be provided in a look-up table or translation table:
library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))
lut
vars1 vars2
1: a a2
2: b b2
3: c c2
In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.
# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
groupVar a b c
1: 1 11 22 33
2: 1 12 23 34
3: 1 13 24 35
4: 2 24 35 46
5: 2 25 36 47
6: 2 26 37 48
7: 3 37 48 59
8: 3 38 49 60
9: 3 39 50 61
10: 3 40 51 62
2. Update joins using get()
Giving a second thought, here is an approach which is similar to OP's proposed for
loop and requires much less coding:
vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}
dt1[]
a b c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3
Note that dt1
is updated by reference, i.e., without copying.
Prepending the variable names vars1[iv]
by "x."
and vars2[iv]
by "i."
on the right hand side of :=
is to ensure that the right columns from dt1
and dt2
, resp., are picked in case of duplicated column names. See the Advanced: section on the j
parameter in help("data.table")
.
3. Computing on the language
This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.
library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
a b c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3
It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement
dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]
as a character string. This string is then evaluated and executed in one go; no for
loops required.
As the helper function EVAL()
already uses paste0()
the call to glue()
can be omitted:
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}
Note that dot .
and curly brackets {}
are used with different meaning in different contexts which may appear somewhat confusing.
Grouped application of function that return a data.frame (without a for loop)
One option is to use dplyr::group_split()
and purrr::map_dfr()
.
How this works: group_split()
will divide your data.frame df
into a list of data.frames based on the grouping variables you supply (e.g., g
). Next, map_dfr()
can be used to apply a function to each element of that list. Because your custom function ff()
returns a data.frame without your grouping variable g
, you'll want to add that information back to ff()
output - this can be accomplished with mutate()
as in the example below:
library(dplyr)
library(purrr)
# set seed so that example is reproducible
set.seed(1)
# your example data and function
df <- data.frame(start=1:10,end=21:30,g=sample(LETTERS[1:2],10,replace=TRUE))
ff <- function(start,end,... ) {
out <- data.frame(T1=c(start,rev(start)),T2=c(end,rev(end)))
return(out)
}
# use group_split & map_dfr
df %>%
# divide df into a list of data.frames based on supplied grouping variables
group_split(g) %>%
# for each element in the list, apply this function
map_dfr(function(df.x) {
with(df.x,
# get the data.frame your function returns
ff(start, end) %>%
# add your grouping variables back-in (stripped by ff)
mutate(g = g[1]))
})
# a short-hand version of the above can be written as:
df %>%
group_split(g) %>%
map_dfr(~ff(.x$start, .x$end) %>% mutate(g = .x$g[1]))
Related Topics
Sub-Assign by Reference on Vector in R
Multiple Histograms in Ggplot2
How to Create Thiessen Polygons from Points Using R Packages
Read CSV File Hosted on Google Drive
How to Create a Different Report for Each Subset of a Data Frame with R Markdown
Ggmap with Geom_Map Superimposed
Ggplot2 Increase Space Between Legend Keys
Generate Numbers with Specific Correlation
How to Set Na.Rm to True Globally
How to Extract Sheet Names from Excel File in R
Merge Data Frames and Overwrite Values
Adding Legend to Ggplot When Lines Were Added Manually
How to Compute Correlations Between All Columns in R and Detect Highly Correlated Variables
Large Matrices in R: Long Vectors Not Supported Yet
Read CSV with Dates and Numbers
R Convert Between Zoo Object and Data Frame, Results Inconsistent for Different Numbers of Columns