Using Mutate Rowwise Over a Subset of Columns

Using mutate rowwise over a subset of columns

Here's one way to approach row-wise computation in the tidyverse using purrr::pmap. This is best used with functions that actually need to be run row by row; simple addition could probably be done a faster way. Basically we use select to provide the input list to pmap, which lets us use the select helpers such as starts_with or matches if you need regex.

library(tidyverse)
df <- tibble(
ID = c("one", "two", "three"),
A1 = c(1, 1, 1),
A2 = c(2, 2, 2),
A3 = c(3, 3, 3)
)

df %>%
mutate(
SumA = pmap_dbl(
.l = select(., starts_with("A")),
.f = function(...) sum(...)
)
)
#> # A tibble: 3 x 5
#> ID A1 A2 A3 SumA
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 one 1 2 3 6
#> 2 two 1 2 3 6
#> 3 three 1 2 3 6

Created on 2019-01-30 by the reprex package (v0.2.1)

Mutate, row-wise, based on matching strings or NA across subset of columns

A base R option and an efficient vectorized option would be rowSums on a logical matrix

nm1 <- startsWith(names(df), 'b')
df$b_is_na <- rowSums(df[nm1] == 'c'|is.na(df[nm1])) > 0
df$b_is_na
#[1] FALSE TRUE TRUE TRUE

It can be also used with the mutate

library(dplyr)
df %>%
mutate(b_is_na = rowSums(select(., starts_with('b')) ==
'c'|is.na(select(., starts_with('b')))) > 0)
# A tibble: 4 x 6
# a b1 b2 b3 b4 b_is_na
# <chr> <chr> <chr> <chr> <chr> <lgl>
#1 a b d e f FALSE
#2 a c <NA> <NA> <NA> TRUE
#3 a <NA> <NA> <NA> <NA> TRUE
#4 <NA> <NA> <NA> <NA> <NA> TRUE

NOTE: Using rowwise would be an inefficient way

Or with c_across, but it may not be that optimal

df %>% 
rowwise %>%
mutate(b_is_na = {
tmp <- c_across(starts_with('b'))
any(is.na(tmp)|tmp == 'c') }) %>%
ungroup
# A tibble: 4 x 6
# a b1 b2 b3 b4 b_is_na
# <chr> <chr> <chr> <chr> <chr> <lgl>
#1 a b d e f FALSE
#2 a c <NA> <NA> <NA> TRUE
#3 a <NA> <NA> <NA> <NA> TRUE
#4 <NA> <NA> <NA> <NA> <NA> TRUE

Performing dplyr mutate on subset of columns

Am I missing something or would this work as expected:

cols <- paste0("X", c(2,4))
dd %>% mutate(evensum = rowSums(.[cols]), evenmean = rowMeans(.[cols]))
# id X1 X2 X3 X4 X5 evensum evenmean
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.4380811
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.8477439
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.8387535
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.5478768

Or are you specifically looking for a custom function to do this?


Not exactly what you are looking for but if you want to do it inside a pipe you could use select explicitly inside mutate like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% rowSums)
# id X1 X2 X3 X4 X5 xy
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535

However, it is a bit more complicated if you want to apply several functions. You could use a helper function along the lines of (..not thoroughly tested.. ):

f <- function(x, ...) {
n <- nrow(x)
x <- lapply(list(...), function(y) if (length(y) == 1L) rep(y, n) else y)
matrix(unlist(x), nrow = n, byrow = FALSE)
}

And then apply it like this:

dd %>% mutate(xy = select(., num_range("X", c(2,4))) %>% f(., rowSums(.), max(.)))
# id X1 X2 X3 X4 X5 xy.1 xy.2
#1 a 0.6021140 0.3670719 0.6872308 0.5090904 0.4474437 0.8761623 0.9888592
#2 b 0.1950439 0.9888592 0.8314290 0.7066286 0.9646670 1.6954878 0.9888592
#3 c 0.9664587 0.8151934 0.1046694 0.8623137 0.1411871 1.6775071 0.9888592
#4 d 0.6509055 0.2539684 0.6461509 0.8417851 0.7767125 1.0957535 0.9888592

Conditional rowwise sum of subset of columns in specific row dplyr

If you want to efficiently update a single row (or small subset of rows) I would use direct assignment, not dplyr.

var_cols = grep(names(df), pattern = "var[0-9]+", value = T)
recalc_id = 2
df[df$ID %in% recalc_id, "var_total"] = apply(df[df$ID %in% recalc_id, var_cols], 1, \(x) sum(x[x > 0]))

As akrun points out in comments, if it is just a single row, the apply can be skipped:

i = which(df$ID == recalc_id)
row = unlist(df[i, var_cols])
df$var_total[i] = sum(row[row > 0])

Here's the same thing with dplyr::case_when, for a dplyr solution:

df = df %>%
rowwise() %>%
mutate(var_total = case_when(
ID %in% 2 ~{
x <- c_across(starts_with('var[0-9]+'))
sum(x[x > 0])
},
TRUE ~ var_total
)
)

(Note that in both cases we need to change the column name pattern to not include var_total in the sum.)

rowwise breaks some vectorization and slows things down, so if you are so concerned about efficiency that recalculating the sum is "too slow", I'd strongly recommend the base solution. You might even find a non-conditional base solution to be plenty fast enough for this row-wise operation.

Combine: rowwise(), mutate(), across(), for multiple functions

Using pmap() from purrr might be more preferable since you need to select the data just once and you can use the select helpers:

df %>% 
mutate(pmap_dfr(across(where(is.numeric)),
~ data.frame(max = max(c(...)),
min = min(c(...)),
avg = mean(c(...)))))

a b c d e max min avg
<int> <int> <int> <chr> <int> <int> <int> <dbl>
1 1 6 11 a 1 11 1 4.75
2 2 7 12 b 2 12 2 5.75
3 3 8 13 c 3 13 3 6.75
4 4 9 14 d 4 14 4 7.75
5 5 10 15 e 5 15 5 8.75

Or with the addition of tidyr:

df %>% 
mutate(res = pmap(across(where(is.numeric)),
~ list(max = max(c(...)),
min = min(c(...)),
avg = mean(c(...))))) %>%
unnest_wider(res)

Is there a way to use rowwise to get means across rows the correct way?

You need to wrap your columns into c_across:

mydf %>%
rowwise() %>%
mutate(allmeanrow = mean(c_across(a1:a3))) %>%
ungroup()

which gives:

# A tibble: 10 x 4
# Rowwise:
a1 a2 a3 allmeanrow
<dbl> <dbl> <dbl> <dbl>
1 3 9 4 5.33
2 4 7 7 6
3 8 7 4 6.33
4 5 6 5 5.33
5 5 4 6 5
6 8 11 2 7
7 5 7 10 7.33
8 2 0 8 3.33
9 3 8 4 5
10 4 4 11 6.33

Note, i would always ungroup after the rowwise operation because rowwise groups your data by row, so any following action would still be performed rowwise.

See also here: https://dplyr.tidyverse.org/articles/rowwise.html

Rowwise average over increasing no. of columns using for loop inside mutate : dplyr R

You can use purrr::reduce(or base::Reduce) to do the iteration.

library(tidyverse)

reduce(2:4, ~ mutate(.x, !!paste0("col1to", .y) := mean(c_across(1:.y))), .init = rowwise(a))

# A tibble: 3 x 7
# Rowwise:
A B C D col1to2 col1to3 col1to4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2 3 4 1.5 2 2.5
2 5 6 7 8 5.5 6 6.5
3 9 10 11 12 9.5 10 10.5
  • base::Reduce version:
Reduce(\(x, y) mutate(x, !!paste0("col1to", y) := mean(c_across(1:y))), 2:4, init = rowwise(a))

To fix your for loop, you need to set different column name to each new column. Otherwise, every new column will have the same name, i.e. "mean(c_across(1:i))", and overrides the former column.

b <- rowwise(a)
for(i in 2:4) {
b <- b %>% mutate(!!paste0("col1to", i) := mean(c_across(1:i)))
}

b

Another choice using tidyr::unnest_wider():

a %>%
rowwise() %>%
mutate(mean = list(cummean(c_across(1:4))[-1])) %>%
unnest_wider(mean, names_sep = "_")

Row-wise cor() on subset of columns using dplyr::mutate()

You could try

df %>% 
rowwise() %>%
do(data.frame(., Cor=cor(unlist(.[1:3]), unlist(.[4:6]))))

How to change the direction of application of mutate across from column-wise to row-wise?

Try this. (Note row 4 also sums across to 1.)

library(tidyverse)

data <-
data.frame(
To = c("A","B","C"),
A = c(1,2,4),
B = c(3,4,5),
C = c(5,6,7)
)

data <- data %>%
replace(is.na(.), 0) %>%
bind_rows(summarise_all(., ~(if(is.numeric(.)) sum(.) else "Sum")))
data <- cbind(data, Sum = rowSums(data[,-1]))

data %>%
rowwise() %>%
mutate(across(A:Sum, ~ sum(.) / Sum))
#> # A tibble: 4 × 5
#> # Rowwise:
#> To A B C Sum
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 0.111 0.333 0.556 1
#> 2 B 0.167 0.333 0.5 1
#> 3 C 0.25 0.312 0.438 1
#> 4 Sum 0.189 0.324 0.486 1

Created on 2022-05-04 by the reprex package (v2.0.1)

How to get the selected max/min value (i.e. second largest/smallest) across row by dplyr

You can use c_across along with sort. The use of rev here reverses the sorted data, making it easy to select the largest value with index 1, the second-largest with index 2, etc.

Note that column "max2" in your example output makes errors in certain rows (I think you may have been including the "max1" column in some cases).

a %>% 
rowwise() %>%
mutate(
max1 = max(gp1, gp2, gp3),
max2 = rev(sort(c_across(c(gp1, gp2, gp3))))[2]
)

gp1 gp2 gp3 max1 max2
<int> <int> <dbl> <dbl> <dbl>
1 3 1 8 8 3
2 4 2 8 8 4
3 5 3 2 5 3
4 6 4 6 6 6
5 7 5 6 7 6
6 8 6 6 8 6
7 9 7 12 12 9
8 10 8 12 12 10
9 11 9 6 11 9


Related Topics



Leave a reply



Submit