R Applying a Function to a Subset of a Data Frame

R applying a function to a subset of a data frame

> aggregate( . ~ z, data=temp, FUN=mean)
z x y
1 1 1.505304 2.474642
2 3 1.533418 2.477191

When you will be applying the same function to multiple columns within categories of another column think about 'aggregate'. This is the version taht takes a formula argument where the "dot" before the tilde says to get the mean of all of the columns besides "z".

R Apply() function on specific dataframe columns

Using an example data.frame and example function (just +1 to all values)

A <- function(x) x + 1
wifi <- data.frame(replicate(9,1:4))
wifi

# X1 X2 X3 X4 X5 X6 X7 X8 X9
#1 1 1 1 1 1 1 1 1 1
#2 2 2 2 2 2 2 2 2 2
#3 3 3 3 3 3 3 3 3 3
#4 4 4 4 4 4 4 4 4 4

data.frame(wifi[1:3], apply(wifi[4:9],2, A) )
#or
cbind(wifi[1:3], apply(wifi[4:9],2, A) )

# X1 X2 X3 X4 X5 X6 X7 X8 X9
#1 1 1 1 2 2 2 2 2 2
#2 2 2 2 3 3 3 3 3 3
#3 3 3 3 4 4 4 4 4 4
#4 4 4 4 5 5 5 5 5 5

Or even:

data.frame(wifi[1:3], lapply(wifi[4:9], A) )
#or
cbind(wifi[1:3], lapply(wifi[4:9], A) )

# X1 X2 X3 X4 X5 X6 X7 X8 X9
#1 1 1 1 2 2 2 2 2 2
#2 2 2 2 3 3 3 3 3 3
#3 3 3 3 4 4 4 4 4 4
#4 4 4 4 5 5 5 5 5 5

R: apply function to subsets based on column value

Using Base R:

library(DescTools)
lapply(split(df,df$region),
function(x) (Gini(x$income, n = rep(1, length(x$income)), unbiased = TRUE,
conf.level = NA, R = 1000, type = "bca", na.rm = TRUE)))

Using tidyverse:

library(tidyverse)
library(DescTools)
df %>% group_by(region) %>% nest() %>%
mutate(gini_coef = map(data, ~Gini(.x$income, n = rep(1, length(.x$income)),
unbiased = TRUE, conf.level = NA, R = 1000, type = "bca", na.rm = TRUE))) %>%
select(-data) %>% unnest() %>% left_join(df)

Joining, by = "region"
# A tibble: 10 x 4
region gini_coef ID income
<fct> <dbl> <int> <int>
1 rot 0.177 1 3700
2 rot 0.177 9 4000
3 rot 0.177 10 4400
4 rot 0.177 12 2000
5 ams 0.0698 2 2500
6 ams 0.0698 6 3100
7 ams 0.0698 8 3000
8 utr 0.154 3 3300
9 utr 0.154 4 5300
10 utr 0.154 5 4400

Data

 df <- read.table(text="  
ID region income
1 rot 3700
2 ams 2500
3 utr 3300
4 utr 5300
5 utr 4400
6 ams 3100
8 ams 3000
9 rot 4000
10 rot 4400
12 rot 2000
",header=T)

Use an apply function to a subset of rows in a data frame - vectorised solution

All other solutions assuming the function called is vectorized, here's another if it's not the case:

sapply( 1:nrow(df.data), function(x) { 
fnATimesB( df.data[x,'days'], df.data[x,'sal'] )
} )

Alternatively, you can use apply here and avoid the anonymous function call, while slightly modifying your original function instead. The only thing to remember is that apply converts the data set to a matrix and thus, you shouldn't have non-numeric columns in the input data, here is an example

fnATimesB <- function(df, a, b) {
df[a] * df[b]
}

apply(df.data[-1L], 1L, fnATimesB, a = 'days', b = 'sal')
## [1] 1000 12000 25000

apply function to subsets of dataframe r

You could use the dplyr package, as follows perhaps?

library(dplyr)
data1 %>%
group_by(Meteostation, Year) %>%
do(data.frame(biovars(.$pr, .$tasmin, .$tasmax)))

Apply custom function to each subset of a data frame and result a dataframe

dplyr

You could use do in dplyr:

library(dplyr)
df %>%
group_by(sample_id) %>%
do(f.get_reg(.))

Which gives:

  sample_id     N       slope intercept            S
(int) (int) (dbl) (dbl) (dbl)
1 6724 3 -0.08518211 26.12125 7.716050e-15
2 6728 3 -0.22387160 41.41037 5.551115e-17

data.table

Use .SD in data.table:

library(data.table)

df <- data.table(df)
df[,f.get_reg(.SD),sample_id]

Which gives the same result:

   sample_id N       slope intercept            S
1: 6724 3 -0.08518211 26.12125 7.716050e-15
2: 6728 3 -0.22387160 41.41037 5.551115e-17

base R

Using by:

resultList <- by(df,df$sample_id,f.get_reg)
sample_id <- names(resultList)
result <- do.call(rbind,resultList)
result$sample_id <- sample_id
rownames(result) <- NULL

Which gives:

  N       slope intercept            S sample_id
1 3 -0.08518211 26.12125 7.716050e-15 6724
2 3 -0.22387160 41.41037 5.551115e-17 6728

creating a function to subset data frame in R

We can use [[ inside a function

f1 <- function(id){
df[df[["ID"]] == id,]
}
f1(11)
# ID Item
#1 11 a

Apply function to subsets of dataset

Here is one option -

  • Use ceiling to round up the time values.
  • for each ID and year calculate the average value.
  • Use complete to create the missing year value.
  • fill to carry forward the average value.
library(dplyr)
library(tidyr)

df %>%
group_by(ID, year = ceiling(time)) %>%
summarise(mean_value = mean(value)) %>%
complete(year = min(year):max(year)) %>%
fill(mean_value) %>%
ungroup

# ID year mean_value
# <int> <dbl> <dbl>
#1 1 1 4
#2 1 2 4
#3 1 3 6
#4 1 4 12
#5 2 1 3
#6 2 2 6
#7 2 3 8
#8 3 1 1.7
#9 3 2 5

data

df <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L), time = c(0.1, 0.5, 2.1, 3.3, 0.3, 0.4,
0.6, 1.2, 1.5, 2.6, 2.7, 0.1, 0.4, 1.3, 1.5, 1.6), value = c(3,
5, 6, 12, 1, 3, 5, 4, 8, 2, 14, 1.1, 2.3, 6, 3, 6)),
class = "data.frame", row.names = c(NA, -16L))


Related Topics



Leave a reply



Submit