Apply a Function Over Groups of Columns

apply a function over groups of columns

This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:

x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

Works if you just have col names too:

x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))

EDIT

Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:

dat <- data.frame(matrix(rnorm(16*100), ncol=100))

n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))

EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:

n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]

do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))

Apply a function across groups and columns in data.table and/or dplyr

The sample datasets provided with the question indicate that the names of the columns may differ between datasets, e.g., column b of dt1 and column b2 of dt2 are supposed to be added.

Here are two approaches which should be working for an arbitrary number of arbitrarily named pairs of columns:

  1. Working in long format
  2. EDIT: Update joins using get()
  3. EDIT 2: Computing on the language

1. Working in long format

The information on corresponding columns can be provided in a look-up table or translation table:

library(data.table)
lut <- data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2"))

lut
   vars1 vars2
1: a a2
2: b b2
3: c c2

In cases where column names are treated as data and the column data are of the same data type my first approach is to reshape to long format.

# reshape to long format
mdt1 <- melt(dt1[, rn := .I], measure.vars = lut$vars1)
mdt2 <- melt(dt2[, groupVar := .I], measure.vars = lut$vars2)
# update join to translate variable names
mdt2[lut, on = .(variable = vars2), variable := vars1]
# update join to add corresponding values of both tables
mdt1[mdt2, on = .(groupVar, variable), value := x.value + i.value]
# reshape backe to wide format
dt3 <- dcast(mdt1, rn + groupVar ~ ...)[, rn := NULL][]
dt3
    groupVar  a  b  c
1: 1 11 22 33
2: 1 12 23 34
3: 1 13 24 35
4: 2 24 35 46
5: 2 25 36 47
6: 2 26 37 48
7: 3 37 48 59
8: 3 38 49 60
9: 3 39 50 61
10: 3 40 51 62

2. Update joins using get()

Giving a second thought, here is an approach which is similar to OP's proposed for loop and requires much less coding:

vars1 <- c("a", "b", "c")
vars2 <- c("a2", "b2", "c2")
dt2[, groupVar := .I]

for (iv in seq_along(vars1)) {
dt1[dt2, on = .(groupVar),
(vars1[iv]) := get(paste0("x.", vars1[iv])) + get(paste0("i.", vars2[iv]))][]
}

dt1[]
     a  b  c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3

Note that dt1 is updated by reference, i.e., without copying.

Prepending the variable names vars1[iv] by "x." and vars2[iv] by "i." on the right hand side of := is to ensure that the right columns from dt1 and dt2, resp., are picked in case of duplicated column names. See the Advanced: section on the j parameter in help("data.table").

3. Computing on the language

This follows Matt Dowle's advice to create one expression to be evaluated, "similar to constructing a dynamic SQL statement to send to a server". See here for another use case.

library(glue) # literal string interpolation
library(magrittr) # piping used to improve readability

EVAL <- function(...) eval(parse(text = paste0(...)), envir = parent.frame(2))

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>%
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{glue("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`({.})][]")} %>%
EVAL()
     a  b  c groupVar
1: 11 22 33 1
2: 12 23 34 1
3: 13 24 35 1
4: 24 35 46 2
5: 25 36 47 2
6: 26 37 48 2
7: 37 48 59 3
8: 38 49 60 3
9: 39 50 61 3
10: 40 51 62 3

It starts with a look-up table which is created on-the-fly and subsequently manipulated to form a complete data.table statement

dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(a = x.a + i.a2, b = x.b + i.b2, c = x.c + i.c2)][]

as a character string. This string is then evaluated and executed in one go; no for loops required.

As the helper function EVAL() already uses paste0() the call to glue() can be omitted:

data.table(vars1 = c("a", "b", "c"), vars2 = c("a2", "b2", "c2")) %>% 
glue_data("{vars1} = x.{vars1} + i.{vars2}") %>%
glue_collapse( sep = ", ") %>%
{EVAL("dt1[dt2[, groupVar := .I], on = .(groupVar), `:=`(", ., ")][]")}

Note that dot . and curly brackets {} are used with different meaning in different contexts which may appear somewhat confusing.

R: Group by and Apply a general function to two columns

Update based on real-life example:

You can do a direct approach like this:

library(tidyverse)
library(InfoTrad)
dat %>%
group_by(ticker, date) %>%
summarize(PIN = YZ(as.data.frame(cur_data()))$PIN)

# A tibble: 4 x 3
# Groups: ticker [2]
ticker date PIN
<chr> <dbl> <dbl>
1 A 1 1.05e-17
2 A 2 1.56e- 1
3 B 1 1.12e- 8
4 B 2 7.07e- 9

The difficulty here was that the YZ function only accepts true data frames, not tibbles and that it returns several values, not just PIN.

You could theoretically wrap this up into your own function and then run your own function like I‘ve shown in the example below, but maybe this way already does the trick.

I also don‘t expect this to run much faster than a for loop. It seems that this YZ function has some more-than-linear runtime, so passing larger amount of data will still take some time. You can try to start with a small set of data and then repeat it by increasing the size of your data with a factor of maybe 10 and then check how fast it runs.


In your example, you can do:

my_function <- function(data) {
data %>%
summarize(rv = sum(ret, vol))
}

library(tidyverse)
df %>%
group_by(ticker, date) %>%
my_function()

# A tibble: 4 x 3
# Groups: ticker [2]
ticker date rv
<chr> <dbl> <dbl>
1 A 1 7
2 A 2 5
3 B 1 10
4 B 2 11

But as mentioned in my comment, I‘m not sure if this general example would help in your real-life use case.

Might also be that you don‘t need to create your own function because built-in functions already exist. Like in the example, you sre better off with directly summarizing instead of wrapping it into a function.

Apply function to certain groups of columns of a pandas dataframe

So we may do it with line by line

df['p_v'] = df.apply(lambda x : stats.wilcoxon(x['col1':'col2'], x['col3':'col4'])[1],axis=1)

Pandas: Apply custom function to groups and store result in new columns in each group

I think you need:

def func(x):
d = (x['Column3'].diff()).dropna()).iloc[0]
last = x.index[-1]
x.loc[last, 'Difference'] = d
x.loc[last, 'Message'] = "Calculated!"
return x

df1 = df.filter(lambda x: is_unique(x['Column1']))

df1 = df1.groupby(['Column2']).apply(func)

pandas dataframe group columns based on name and apply a function

You can convert columns without separator to index and then grouping with lambda function per columns with aggregate function like max:

m = df.columns.str.contains('_')

df = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.max()
.reset_index())
print (df)
A B C D E K
0 a 2 r 4 6 9
1 e g 1 d 8 7

Solution with custom function:

def rms(x):
return np.sqrt(np.sum(x**2, axis=1)/len(x.columns))

m = df.columns.str.contains('_')

df1 = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.agg(rms)
.reset_index())
print (df1)
A B C D E K
0 a 2 r 4 3.915780 5.972158
1 e g 1 d 5.567764 4.690416

Applying function to each group and column of R dataframe

You can use the package dplyr. Use group_by to do it for each Category and mutate_if to apply the function to all numerical columns

library(dplyr)
df <- read.table(header = TRUE, text =
" Category a b c
a 2.0 5.0 -5.0
a 1.5 10.0 10.0
b 3.2 14.5 100.2")
replace_outliers <- function(column) {
qnt <- quantile(column, probs=c(.25, .75))
upper_whisker <- 1.5 * IQR(column)
clean_data <- column
clean_data[column > (qnt[2] + upper_whisker)] <- median(column)
clean_data
}

df %>% group_by(Category) %>%
mutate_if(is.numeric, replace_outliers)

Groupby and apply a specific function to certain columns and another function to the rest of the df Pandas

You can add if else in agg

df = df.groupby('id').agg(lambda x : x.count() if x.name in ['var1','var2'] else x.mean())


Related Topics



Leave a reply



Submit