apply a function over groups of columns
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply
with do.call
rather than sapply
:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
Apply a function across groups and columns in data.table and/or dplyr
The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b
of dt1
and column b2
of dt2
are supposed to be added.
Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:
- Working in long format
- EDIT: Update joins using
get()
- EDIT 2: Computing on the language
1. Working in long format
The information on corresponding columns can be provided in a look-up table or translation table:
library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))
lut
vars1 vars2
1: a a2
2: b b2
3: c c2
In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.
# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
groupVar a b c
1: 1 11 22 33
2: 1 12 23 34
3: 1 13 24 35
4: 2 24 35 46
5: 2 25 36 47
6: 2 26 37 48
7: 3 37 48 59
8: 3 38 49 60
9: 3 39 50 61
10: 3 40 51 62
2. Update joins using get()
Giving a second thought, here is an approach which is similar to OP's proposed for
loop and requires much less coding:
vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]
for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}
dt1[]
a b c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3
Note that dt1
is updated by reference, i.e., without copying.
Prepending the variable names vars1[iv]
by "x."
and vars2[iv]
by "i."
on the right hand side of :=
is to ensure that the right columns from dt1
and dt2
, resp., are picked in case of duplicated column names. See the Advanced: section on the j
parameter in help("data.table")
.
3. Computing on the language
This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.
library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability
EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
a b c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3
It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement
dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]
as a character string. This string is then evaluated and executed in one go; no for
loops required.
As the helper function EVAL()
already uses paste0()
the call to glue()
can be omitted:
data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}
Note that dot .
and curly brackets {}
are used with different meaning in different contexts which may appear somewhat confusing.
R: Group by and Apply a general function to two columns
Update based on real-life example:
You can do a direct approach like this:
library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9
The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.
You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.
I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.
In your example, you can do:
my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}
library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()
# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11
But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.
Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.
Apply function to certain groups of columns of a pandas dataframe
So we may do it with line by line
df['p_v'] = df.apply(lambda x : stats.wilcoxon(x['col1':'col2'], x['col3':'col4'])[1],axis=1)
Pandas: Apply custom function to groups and store result in new columns in each group
I think you need:
def func(x):
d = (x['Column3'].diff()).dropna()).iloc[0]
last = x.index[-1]
x.loc[last, 'Difference'] = d
x.loc[last, 'Message'] = "Calculated!"
return x
df1 = df.filter(lambda x: is_unique(x['Column1']))
df1 = df1.groupby(['Column2']).apply(func)
pandas dataframe group columns based on name and apply a function
You can convert columns without separator to index and then grouping with lambda function per columns with aggregate function like max
:
m = df.columns.str.contains('_')
df = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.max()
.reset_index())
print (df)
A B C D E K
0 a 2 r 4 6 9
1 e g 1 d 8 7
Solution with custom function:
def rms(x):
return np.sqrt(np.sum(x**2, axis=1)/len(x.columns))
m = df.columns.str.contains('_')
df1 = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.agg(rms)
.reset_index())
print (df1)
A B C D E K
0 a 2 r 4 3.915780 5.972158
1 e g 1 d 5.567764 4.690416
Applying function to each group and column of R dataframe
You can use the package dplyr
. Use group_by
to do it for each Category and mutate_if
to apply the function to all numerical columns
library(dplyr)
df <- read.table(header = TRUE, text =
" Category a b c
a 2.0 5.0 -5.0
a 1.5 10.0 10.0
b 3.2 14.5 100.2")
replace_outliers <- function(column) {
qnt <- quantile(column, probs=c(.25, .75))
upper_whisker <- 1.5 * IQR(column)
clean_data <- column
clean_data[column > (qnt[2] + upper_whisker)] <- median(column)
clean_data
}
df %>% group_by(Category) %>%
mutate_if(is.numeric, replace_outliers)
Groupby and apply a specific function to certain columns and another function to the rest of the df Pandas
You can add if else in agg
df = df.groupby('id').agg(lambda x : x.count() if x.name in ['var1','var2'] else x.mean())
Related Topics
R: How to Filter/Subset a Sequence of Dates
Error in Model.Frame.Default: Variable Lengths Differ
Changing the Line Type in the Ggplot Legend
Sample Rows of Subgroups from Dataframe with Dplyr
Marker Mouse Click Event in R Leaflet for Shiny
Subfigures or Subcaptions with Knitr
How to Produce Different Geom_Vline in Different Facets in R
Listing Contents of an R Data File Without Loading
Use Rle to Group by Runs When Using Dplyr
Transforming a Time-Series into a Data Frame and Back
Use Grepl to Search Either of Multiple Substrings in a Text
Stacked Barplot with Colour Gradients for Each Bar
Calculate Multiple Aggregations on Several Variables Using Lapply(.Sd, ...)