R Dplyr Rowwise Mean or Min and Other Methods

What is the fastest way to use `dplyr` to find the row-wise mean and variance of a large tibble?

It seems like the processing time for the rowwise approach explodes quadratically:

Sample Image

Pivoting longer makes the calculation about 300x faster. For 50k rows, the code below took 1.2 seconds, compared to 372 seconds for the rowwise approach.

df %>%
mutate(row = row_number()) %>%
tidyr::pivot_longer(-row) %>%
group_by(row) %>%
summarize(mean = mean(value),
var = var(value)) %>%
bind_cols(df, .)

Is there a way to use rowwise to get means across rows the correct way?

You need to wrap your columns into c_across:

mydf %>%
rowwise() %>%
mutate(allmeanrow = mean(c_across(a1:a3))) %>%
ungroup()

which gives:

# A tibble: 10 x 4
# Rowwise:
a1 a2 a3 allmeanrow
<dbl> <dbl> <dbl> <dbl>
1 3 9 4 5.33
2 4 7 7 6
3 8 7 4 6.33
4 5 6 5 5.33
5 5 4 6 5
6 8 11 2 7
7 5 7 10 7.33
8 2 0 8 3.33
9 3 8 4 5
10 4 4 11 6.33

Note, i would always ungroup after the rowwise operation because rowwise groups your data by row, so any following action would still be performed rowwise.

See also here: https://dplyr.tidyverse.org/articles/rowwise.html

switching between row wise and normal calculation in dplyr

Use pmax instead:

df %>%
group_by(cat) %>%
mutate(MAX = pmax(a,b,c), RANK = rank(MAX))

#Source: local data frame [25 x 6]
#Groups: cat [5]
#
# cat a b c MAX RANK
# (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
#1 a 0.8514743 0.4051721 0.5547097 0.8514743 3
#2 a 0.2599851 0.3559089 0.2878979 0.3559089 1
#3 a 0.1269608 0.6784622 0.1088647 0.6784622 2
#4 a 0.6764552 0.9013046 0.8811285 0.9013046 4
#5 a 0.9284667 0.3482491 0.9241917 0.9284667 5
#6 b 0.7799539 0.8811926 0.6788580 0.8811926 3
#7 b 0.8974519 0.8499074 0.1847301 0.8974519 4
#8 b 0.6819014 0.9334878 0.6150482 0.9334878 5
#9 b 0.2539325 0.1707934 0.2902184 0.2902184 1
#10 b 0.7913689 0.5555206 0.3349121 0.7913689 2
#.. ... ... ... ... ... ...

pmax and pmin return the parallel maxima and minima of the input values. That means, you don't need to group by row (using rowwise).

Small example:

pmax(1:5, 5:1)
#[1] 5 4 3 4 5

If you need more flexibility, you can do it in two steps:

df %>%
rowwise() %>%
mutate(MAX = max(a,b,c)) %>%
group_by(cat) %>%
mutate(RANK = rank(MAX))

But for the described example, pmax is clearly a better approach.

Another option is to use rowMeans:

df %>% 
mutate(MAX = rowMeans(.[c('a','b','c')])) %>%
group_by(cat) %>%
mutate(RANK = rank(MAX))

If you don't want to specify the column names manually, you could use do.call and exclude the first column:

df %>% mutate(MAX = do.call(pmax, .[-1]))

R dplyr get name of which.min rowwise

This seems kind of (well, actually very) clunky but...

DF %>% 
mutate(y = apply(.[,2:4], 1, function(x) names(x)[which.min(x)]))
   id x1 x2 x3  y
1 1 2 0 5 x2
2 2 4 1 3 x2
3 3 5 2 4 x2
4 4 3 6 5 x1
5 5 6 7 8 x1
6 6 4 6 3 x3
7 7 3 0 4 x2
8 8 6 8 2 x3
9 9 7 2 5 x2
10 10 7 2 6 x2

Return statistics like min or max from columns into rows with dplyr pipeline

I would suggest a mix of tidyverse functions like next. You have to reshape your data, then aggregate with the summary functions you want and then as strategy you can re format again and obtain the expected output:

library(tidyverse)

sampleData %>% pivot_longer(cols = names(sampleData)) %>%
group_by(name) %>% summarise(Min=min(value,na.rm=T),
Max=max(value,na.rm=T)) %>%
rename(var=name) %>%
pivot_longer(cols = -var) %>%
pivot_wider(names_from = var,values_from=value)

The output:

# A tibble: 2 x 6
name A I O R U
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Min -2.21 0.197 1 1 1
2 Max 2.40 14.3 9 81 81

Make rowwise operations on a vector of columns

We can use pmap_dbl from purrr.

library(dplyr)
library(purrr)

t %>% mutate(min_x_y = pmap_dbl(select(., vars.), min))

# x y z min_x_y
#1 1 2 4 1
#2 2 3 5 2
#3 3 4 6 3
#4 4 5 7 4

A base R version would be

t$min_x_y <- do.call(pmin, t[vars.])


Related Topics



Leave a reply



Submit