Performing Dplyr Mutate on Subset of Columns

Performing dplyr mutate on subset of columns

Am I missing something or would this work as expected:

cols <- paste0("X", c(2,4))
dd %>% mutate(evensum = rowSums(.[cols]), evenmean = rowMeans(.[cols]))
# id X1 X2 X3 X4 X5 evensum evenmean
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768

Or are you specifically looking for a custom function to do this?


Not exactly what you are looking for but if you want to do it inside a pipe you could use select explicitly inside mutate like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% rowSums)
# id X1 X2 X3 X4 X5 xy
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535

However, it is a bit more complicated if you want to apply several functions. You could use a helper function along the lines of (..not thoroughly tested.. ):

f <- function(x, ...) {
n <- nrow(x)
x <- lapply(list(...), function(y) if (length(y) == 1L) rep(y, n) else y)
matrix(unlist(x), nrow = n, byrow = FALSE)
}

And then apply it like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% f(., rowSums(.), max(.)))
# id X1 X2 X3 X4 X5 xy.1 xy.2
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.9888592
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.9888592
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.9888592
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.9888592

R - mutate a subset of columns only on a subset of rows

See this post for more info

df1 %>%
mutate_at(vars(starts_with("B")),
.funs = list(~ if_else(Date %in% as.Date(c("2020-01-01", "2020-01-06")), 0.2 * ., .)))

dplyr mutate/replace several columns on a subset of rows

These solutions (1) maintain the pipeline, (2) do not overwrite the input and (3) only require that the condition be specified once:

1a) mutate_cond Create a simple function for data frames or data tables that can be incorporated into pipelines. This function is like mutate but only acts on the rows satisfying the condition:

mutate_cond <- function(.data, condition, ..., envir = parent.frame()) {
condition <- eval(substitute(condition), .data, envir)
.data[condition, ] <- .data[condition, ] %>% mutate(...)
.data
}

DF %>% mutate_cond(measure == 'exit', qty.exit = qty, cf = 0, delta.watts = 13)

1b) mutate_last This is an alternative function for data frames or data tables which again is like mutate but is only used within group_by (as in the example below) and only operates on the last group rather than every group. Note that TRUE > FALSE so if group_by specifies a condition then mutate_last will only operate on rows satisfying that condition.

mutate_last <- function(.data, ...) {
n <- n_groups(.data)
indices <- attr(.data, "indices")[[n]] + 1
.data[indices, ] <- .data[indices, ] %>% mutate(...)
.data
}


DF %>%
group_by(is.exit = measure == 'exit') %>%
mutate_last(qty.exit = qty, cf = 0, delta.watts = 13) %>%
ungroup() %>%
select(-is.exit)

2) factor out condition Factor out the condition by making it an extra column which is later removed. Then use ifelse, replace or arithmetic with logicals as illustrated. This also works for data tables.

library(dplyr)

DF %>% mutate(is.exit = measure == 'exit',
qty.exit = ifelse(is.exit, qty, qty.exit),
cf = (!is.exit) * cf,
delta.watts = replace(delta.watts, is.exit, 13)) %>%
select(-is.exit)

3) sqldf We could use SQL update via the sqldf package in the pipeline for data frames (but not data tables unless we convert them -- this may represent a bug in dplyr. See dplyr issue 1579). It may seem that we are undesirably modifying the input in this code due to the existence of the update but in fact the update is acting on a copy of the input in the temporarily generated database and not on the actual input.

library(sqldf)

DF %>%
do(sqldf(c("update '.'
set 'qty.exit' = qty, cf = 0, 'delta.watts' = 13
where measure = 'exit'",
"select * from '.'")))

4) row_case_when Also check out row_case_when defined in
Returning a tibble: how to vectorize with case_when? . It uses a syntax similar to case_when but applies to rows.

library(dplyr)

DF %>%
row_case_when(
measure == "exit" ~ data.frame(qty.exit = qty, cf = 0, delta.watts = 13),
TRUE ~ data.frame(qty.exit, cf, delta.watts)
)

Note 1: We used this as DF

set.seed(1)
DF <- data.frame(site = sample(1:6, 50, replace=T),
space = sample(1:4, 50, replace=T),
measure = sample(c('cfl', 'led', 'linear', 'exit'), 50,
replace=T),
qty = round(runif(50) * 30),
qty.exit = 0,
delta.watts = sample(10.5:100.5, 50, replace=T),
cf = runif(50))

Note 2: The problem of how to easily specify updating a subset of rows is also discussed in dplyr issues 134, 631, 1518 and 1573 with 631 being the main thread and 1573 being a review of the answers here.

Mutate across all but some columns using dplyr

It's easier than you think:

mtcars %>% mutate(across(-c(gear, carb), demean))
mpg cyl disp hp drat wt
Mazda RX4 0.909375 -0.1875 -70.721875 -36.6875 0.3034375 -0.59725
Mazda RX4 Wag 0.909375 -0.1875 -70.721875 -36.6875 0.3034375 -0.34225
Datsun 710 2.709375 -2.1875 -122.721875 -53.6875 0.2534375 -0.89725
Hornet 4 Drive 1.309375 -0.1875 27.278125 -36.6875 -0.5165625 -0.00225
Hornet Sportabout -1.390625 1.8125 129.278125 28.3125 -0.4465625 0.22275
Valiant -1.990625 -0.1875 -5.721875 -41.6875 -0.8365625 0.24275
Duster 360 -5.790625 1.8125 129.278125 98.3125 -0.3865625 0.35275
Merc 240D 4.309375 -2.1875 -84.021875 -84.6875 0.0934375 -0.02725
Merc 230 2.709375 -2.1875 -89.921875 -51.6875 0.3234375 -0.06725
qsec vs am gear carb
Mazda RX4 -1.38875 -0.4375 0.59375 4 4
Mazda RX4 Wag -0.82875 -0.4375 0.59375 4 4
Datsun 710 0.76125 0.5625 0.59375 4 1
Hornet 4 Drive 1.59125 0.5625 -0.40625 3 1
Hornet Sportabout -0.82875 -0.4375 -0.40625 3 2
Valiant 2.37125 0.5625 -0.40625 3 1
Duster 360 -2.00875 -0.4375 -0.40625 3 4
Merc 240D 2.15125 0.5625 -0.40625 4 2
Merc 230 5.05125 0.5625 -0.40625 4 2
[ reached 'max' / getOption("max.print") -- omitted 23 rows ]

R How to mutate a subset of rows

Using data.table, we'd do:

setDT(data)[colA == "ABC", ColB := "XXXX"]

and the values are modified in-place, unlike if-else, which'd copy the entire column to replace just those rows where the condition satisfies.

We call this sub-assign by reference. You can read more about it in the new HTML vignettes.

dplyr on subset of columns while keeping the rest of the data.frame

dg <- iris %>% mutate_each(funs(Replace15), matches("^Petal"))

Alternatively (as posted by @aosmith) you could use starts_with. Have a look at ?select for the other special functions available within select, summarise_each and mutate_each.

Copying values from one subset to all others for selected columns using dplyr

I think you could calculate the correct median for each user using only the first record for each user, and then left_join.

df = 
tibble(
UserId = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "D", "D", "D"),
DFFU = c(0, 1, 2, 3, 4, 0, 2, 4, 5, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4),
Q1 = c(3, 1, 1, 0, 1, 4, 2, 5, 4, 5, 2, 5, 6, 6, 5, 5, 4, 0, 1),
Q2 = c(2,0,1,2,1,8,2,6,5,7,2,5,5,6,3,3,2,0,1),
Q3 = c(1,0,0,0,1,2,1,5,5,2,2,4,3,4,5,4,6,1,1)
)

df <- df %>%
group_by(UserId) %>%
mutate(across(all_of(c("Q1", "Q2", "Q3")), sd,.names = paste0("Sigma_", "{.col}"))) %>%
ungroup()

df %>%
filter(DFFU == 0) %>%
transmute(UserId = UserId, across(all_of(paste0("Sigma_", c("Q1", "Q2", "Q3"))), median ,.names = paste0("Median_", "{.col}"))) %>%
{left_join(df, .)}

Yielding:

> df
# A tibble: 19 x 11
UserId DFFU Q1 Q2 Q3 Sigma_Q1 Sigma_Q2 Sigma_Q3 Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 0 3 2 1 1.10 0.837 0.548 1.45 1.59 1.53
2 A 1 1 0 0 1.10 0.837 0.548 1.45 1.59 1.53
3 A 2 1 1 0 1.10 0.837 0.548 1.45 1.59 1.53
4 A 3 0 2 0 1.10 0.837 0.548 1.45 1.59 1.53
5 A 4 1 1 1 1.10 0.837 0.548 1.45 1.59 1.53
6 B 0 4 8 2 1.26 2.5 2.06 1.45 1.59 1.53
7 B 2 2 2 1 1.26 2.5 2.06 1.45 1.59 1.53
8 B 4 5 6 5 1.26 2.5 2.06 1.45 1.59 1.53
9 B 5 4 5 5 1.26 2.5 2.06 1.45 1.59 1.53
10 C 0 5 7 2 1.64 1.87 1 1.45 1.59 1.53
11 C 1 2 2 2 1.64 1.87 1 1.45 1.59 1.53
12 C 2 5 5 4 1.64 1.87 1 1.45 1.59 1.53
13 C 3 6 5 3 1.64 1.87 1 1.45 1.59 1.53
14 C 4 6 6 4 1.64 1.87 1 1.45 1.59 1.53
15 D 0 5 3 5 2.35 1.30 2.30 1.45 1.59 1.53
16 D 1 5 3 4 2.35 1.30 2.30 1.45 1.59 1.53
17 D 2 4 2 6 2.35 1.30 2.30 1.45 1.59 1.53
18 D 3 0 0 1 2.35 1.30 2.30 1.45 1.59 1.53
19 D 4 1 1 1 2.35 1.30 2.30 1.45 1.59 1.53

One of the reasons your analysis is getting so weird, tho, is that you are breaking tidy data principles. In your original data set, each row represents one survey, but the standard deviation applies to each student, not to each survey. So the standard deviation values should appear in a table with 5 rows, one row for each student. Then the median represent the population of students. There is only one population, so there should only be one row. Therefore, I'd recommend:

sd_df <- 
df %>%
group_by(UserId) %>%
summarize(
across(
all_of(c("Q1", "Q2", "Q3")),
.fns = sd,
.names = paste0("Sigma_", "{.col}")
)
)

median_sd_df <-
sd_df %>%
summarize(
across(
all_of(paste0("Sigma_", c("Q1", "Q2", "Q3"))),
.fns = median,
.names = paste0("Sigma_", "{.col}")
),
n = n()
)

which gives you:

> sd_df
# A tibble: 4 x 5
UserId Sigma_Q1 Sigma_Q2 Sigma_Q3 n
<chr> <dbl> <dbl> <dbl> <int>
1 A 1.10 0.837 0.548 5
2 B 1.26 2.5 2.06 4
3 C 1.64 1.87 1 5
4 D 2.35 1.30 2.30 5

> median_sd_df
# A tibble: 1 x 3
Median_Sigma_Q1 Median_Sigma_Q2 Median_Sigma_Q3
<dbl> <dbl> <dbl>
1 1.45 1.59 1.53

Rowwise average over increasing no. of columns using for loop inside mutate : dplyr R

You can use purrr::reduce(or base::Reduce) to do the iteration.

library(tidyverse)

reduce(2:4, ~ mutate(.x, !!paste0("col1to", .y) := mean(c_across(1:.y))), .init = rowwise(a))

# A tibble: 3 x 7
# Rowwise:
A B C D col1to2 col1to3 col1to4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 4 1.5 2 2.5
2 5 6 7 8 5.5 6 6.5
3 9 10 11 12 9.5 10 10.5
  • base::Reduce version:
Reduce(\(x, y) mutate(x, !!paste0("col1to", y) := mean(c_across(1:y))), 2:4, init = rowwise(a))

To fix your for loop, you need to set different column name to each new column. Otherwise, every new column will have the same name, i.e. "mean(c_across(1:i))", and overrides the former column.

b <- rowwise(a)
for(i in 2:4) {
b <- b %>% mutate(!!paste0("col1to", i) := mean(c_across(1:i)))
}

b

Another choice using tidyr::unnest_wider():

a %>%
rowwise() %>%
mutate(mean = list(cummean(c_across(1:4))[-1])) %>%
unnest_wider(mean, names_sep = "_")

How to write for loop to mutate several columns using dplyr?

When you want to mutate several columns the same way, the answer is across(), not a loop. I'm having trouble matching your code/description with your desired output, so here's a small example that (almost) matches your desired output. The difference is that I kept the original data with the original column names and added _edited to the modified values - it's easier that way.

df %>%
mutate(across(everything(),
~ coalesce(as.integer(.x > 0), 0),
.names = "{.col}_new"
)) %>%
mutate(across(!contains("new"), I, .names = "{.col}_backup"))
# q1_1 q1_2 q1_1_new q1_2_new q1_1_backup q1_2_backup
# 1 1 2 1 1 1 2
# 2 1 2 1 1 1 2
# 3 1 2 1 1 1 2
# 4 NA NA 0 0 NA NA
# 5 0 0 0 0 0 0

You can see how the new names are defined with {.col} being the original column name.

The colwise vignette is a good read if you want to learn more about across().



Related Topics



Leave a reply



Submit