Dplyr Rowwise Sum and Other Functions Like Max

dplyr rowwise sum and other functions like max

In short: you are expecting the "sum" function to be aware of dplyr data structures like a data frame grouped by row. sum is not aware of it so it just takes the sum of the whole data.frame.

Here is a brief explanation. This:

select(iris, starts_with('Petal')) %>% rowwise() %>% sum()

Can be rewritten without using the pipe operator as the following:

data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)

As you can see you were constructing something called a tibble. Then the rowwise call adds additional information on this object and specifies that it should be grouped row-wise.

However only the functions aware of this grouping like summarize and mutate can work like intended. Base R functions like sum are not aware of these objects and treat them as any standard data.frames. And the standard approach for sum() is to sum the entire data frame.

Using mutate works:

select(iris, starts_with('Petal')) %>%
rowwise() %>%
mutate(sum = sum(Petal.Width, Petal.Length))

Result:

Source: local data frame [150 x 3]
Groups: <by row>

# A tibble: 150 x 3
Petal.Length Petal.Width sum
<dbl> <dbl> <dbl>
1 1.40 0.200 1.60
2 1.40 0.200 1.60
3 1.30 0.200 1.50
...

data.table row-wise sum, mean, min, max like dplyr?

You can use an efficient row-wise functions from matrixStats package.

library(matrixStats)
dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]

dt
# ProductName Country Q1 Q2 Q3 Q4 MIN MAX AVG SUM
# 1: Lettuce CA NA 22 51 79 22 79 50.66667 152
# 2: Beetroot FR 61 8 NA 10 8 61 26.33333 79
# 3: Spinach FR 40 NA 79 49 40 79 56.00000 168
# 4: Kale CA 54 5 16 NA 5 54 25.00000 75
# 5: Carrot CA NA NA NA NA Inf -Inf NaN 0

For dataset with 500000 rows(using the data.table from CRAN)

dt <- rbindlist(lapply(1:100000, function(i)dt))
system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")])
# user system elapsed
# 0.089 0.004 0.093

rowwise (or by=1:nrow(dt)) is "euphemism" for for loop, as exemplified by

library(dplyr) ; library(magrittr)
system.time(dt %>% rowwise() %>%
transmute(ProductName, Country, Q1, Q2, Q3, Q4,
MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)))
# user system elapsed
# 80.832 0.111 80.974

system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] )
# user system elapsed
# 141.492 0.196 141.757

Tidyverse Rowwise sum of columns that may or may not exist

if I understood your problem correctly this would be a solution (slight modification of @Duck's comment:

library(tidyverse)

data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA),
a = c(rnorm(1,1,n = 10)*1000,NA,NA))

wishlist <- c("x","y","w")

data %>%
dplyr::rowwise() %>%
dplyr::mutate(Sum=sum(c_across(colnames(data)[colnames(data) %in% wishlist]),na.rm=T))

x y a Sum
<dbl> <dbl> <dbl> <dbl>
1 3496. 439. -47.7 3935.
2 6046. 460. 2419. 6506.
3 6364. 672. 1030. 7036.
4 1068. 1282. 2811. 2350.
5 2455. 990. 689. 3445.
6 6477. -612. -1509. 5865.
7 7623. 1554. 2828. 9177.
8 5120. 482. -765. 5602.
9 1547. 1328. 817. 2875.
10 5602. -1019. 695. 4582.
11 NA NA NA 0
12 1000 NA NA 1000

Rowwise operations across columns

For a general solution add rowwise :

library(dplyr)

data.frame(a = c(1:5, 6:10),
b = c(6:10, 1:5)) %>%
rowwise() %>%
mutate(MAX_COLUMN = max(c_across(a:b)))

# a b MAX_COLUMN
# <int> <int> <int>
# 1 1 6 6
# 2 2 7 7
# 3 3 8 8
# 4 4 9 9
# 5 5 10 10
# 6 6 1 6
# 7 7 2 7
# 8 8 3 8
# 9 9 4 9
#10 10 5 10

If you want to take max a faster option would be pmax with do.call.

data.frame(a = c(1:5, 6:10),
b = c(6:10, 1:5)) %>%
mutate(MAX_COLUMN = do.call(pmax, .))

Sum across multiple columns with dplyr

dplyr >= 1.0.0 using across

sum up each row using rowSums (rowwise works for any aggreation, but is slower)

df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(across(where(is.numeric))))

sum down each column

df %>%
summarise(across(everything(), ~ sum(., is.na(.), 0)))

dplyr < 1.0.0

sum up each row

df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(.[1:5]))

sum down each column using superseeded summarise_all:

df %>%
replace(is.na(.), 0) %>%
summarise_all(funs(sum))

Combine: rowwise(), mutate(), across(), for multiple functions

Using pmap() from purrr might be more preferable since you need to select the data just once and you can use the select helpers:

df %>% 
mutate(pmap_dfr(across(where(is.numeric)),
~ data.frame(max = max(c(...)),
min = min(c(...)),
avg = mean(c(...)))))

a b c d e max min avg
<int> <int> <int> <chr> <int> <int> <int> <dbl>
1 1 6 11 a 1 11 1 4.75
2 2 7 12 b 2 12 2 5.75
3 3 8 13 c 3 13 3 6.75
4 4 9 14 d 4 14 4 7.75
5 5 10 15 e 5 15 5 8.75

Or with the addition of tidyr:

df %>% 
mutate(res = pmap(across(where(is.numeric)),
~ list(max = max(c(...)),
min = min(c(...)),
avg = mean(c(...))))) %>%
unnest_wider(res)

Apply `dplyr::rowwise` in all variables

This can be done using purrr::pmap, which passes a list of arguments to a function that accepts "dots". Since most functions like mean, sd, etc. work with vectors, you need to pair the call with a domain lifter:

df_1 %>% select(-y) %>% mutate( var = pmap(., lift_vd(mean)) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 68.48282
# 2 49.40462 47.00752 21.99248 78.87789 49.32063

df_1 %>% select(-y) %>% mutate( var = pmap(., lift_vd(sd)) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 13.88555
# 2 49.40462 47.00752 21.99248 78.87789 23.27958

The function sum accepts dots directly, so you don't need to lift its domain:

df_1 %>% select(-y) %>% mutate( var = pmap(., sum) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 273.9313
# 2 49.40462 47.00752 21.99248 78.87789 197.2825

Everything conforms to the standard dplyr data processing, so all three can be combined as separate arguments to mutate:

df_1 %>% select(-y) %>% 
mutate( v1 = pmap(., lift_vd(mean)),
v2 = pmap(., lift_vd(sd)),
v3 = pmap(., sum) )
# x.1 x.2 x.3 x.4 v1 v2 v3
# 1 70.12072 62.99024 54.00672 86.81358 68.48282 13.88555 273.9313
# 2 49.40462 47.00752 21.99248 78.87789 49.32063 23.27958 197.2825


Related Topics



Leave a reply



Submit