dplyr rowwise sum and other functions like max
In short: you are expecting the "sum" function to be aware of dplyr
data structures like a data frame grouped by row. sum
is not aware of it so it just takes the sum of the whole data.frame
.
Here is a brief explanation. This:
select(iris, starts_with('Petal')) %>% rowwise() %>% sum()
Can be rewritten without using the pipe operator as the following:
data <- select(iris, starts_with('Petal'))
data <- rowwise(data)
sum(data)
As you can see you were constructing something called a tibble
. Then the rowwise
call adds additional information on this object and specifies that it should be grouped row-wise.
However only the functions aware of this grouping like summarize
and mutate
can work like intended. Base R functions like sum
are not aware of these objects and treat them as any standard data.frame
s. And the standard approach for sum()
is to sum the entire data frame.
Using mutate
works:
select(iris, starts_with('Petal')) %>%
rowwise() %>%
mutate(sum = sum(Petal.Width, Petal.Length))
Result:
Source: local data frame [150 x 3]
Groups: <by row>
# A tibble: 150 x 3
Petal.Length Petal.Width sum
<dbl> <dbl> <dbl>
1 1.40 0.200 1.60
2 1.40 0.200 1.60
3 1.30 0.200 1.50
...
data.table row-wise sum, mean, min, max like dplyr?
You can use an efficient row-wise functions from matrixStats
package.
library(matrixStats)
dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c(Q1, Q2,Q3,Q4)]
dt
# ProductName Country Q1 Q2 Q3 Q4 MIN MAX AVG SUM
# 1: Lettuce CA NA 22 51 79 22 79 50.66667 152
# 2: Beetroot FR 61 8 NA 10 8 61 26.33333 79
# 3: Spinach FR 40 NA 79 49 40 79 56.00000 168
# 4: Kale CA 54 5 16 NA 5 54 25.00000 75
# 5: Carrot CA NA NA NA NA Inf -Inf NaN 0
For dataset with 500000 rows(using the data.table
from CRAN)
dt <- rbindlist(lapply(1:100000, function(i)dt))
system.time(dt[, `:=`(MIN = rowMins(as.matrix(.SD), na.rm=T),
MAX = rowMaxs(as.matrix(.SD), na.rm=T),
AVG = rowMeans(.SD, na.rm=T),
SUM = rowSums(.SD, na.rm=T)), .SDcols=c("Q1", "Q2","Q3","Q4")])
# user system elapsed
# 0.089 0.004 0.093
rowwise
(or by=1:nrow(dt)
) is "euphemism" for for loop
, as exemplified by
library(dplyr) ; library(magrittr)
system.time(dt %>% rowwise() %>%
transmute(ProductName, Country, Q1, Q2, Q3, Q4,
MIN = min (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
MAX = max (c(Q1, Q2, Q3, Q4), na.rm=TRUE),
AVG = mean(c(Q1, Q2, Q3, Q4), na.rm=TRUE),
SUM = sum (c(Q1, Q2, Q3, Q4), na.rm=TRUE)))
# user system elapsed
# 80.832 0.111 80.974
system.time(dt[, `:=`(AVG= mean(as.numeric(.SD),na.rm=TRUE),MIN = min(.SD, na.rm=TRUE),MAX = max(.SD, na.rm=TRUE),SUM = sum(.SD, na.rm=TRUE)),.SDcols=c("Q1", "Q2","Q3","Q4"),by=1:nrow(dt)] )
# user system elapsed
# 141.492 0.196 141.757
Tidyverse Rowwise sum of columns that may or may not exist
if I understood your problem correctly this would be a solution (slight modification of @Duck's comment:
library(tidyverse)
data <- tibble(x = c(rnorm(5,2,n = 10)*1000,NA,1000),
y = c(rnorm(1,1,n = 10)*1000,NA,NA),
a = c(rnorm(1,1,n = 10)*1000,NA,NA))
wishlist <- c("x","y","w")
data %>%
dplyr::rowwise() %>%
dplyr::mutate(Sum=sum(c_across(colnames(data)[colnames(data) %in% wishlist]),na.rm=T))
x y a Sum
<dbl> <dbl> <dbl> <dbl>
1 3496. 439. -47.7 3935.
2 6046. 460. 2419. 6506.
3 6364. 672. 1030. 7036.
4 1068. 1282. 2811. 2350.
5 2455. 990. 689. 3445.
6 6477. -612. -1509. 5865.
7 7623. 1554. 2828. 9177.
8 5120. 482. -765. 5602.
9 1547. 1328. 817. 2875.
10 5602. -1019. 695. 4582.
11 NA NA NA 0
12 1000 NA NA 1000
Rowwise operations across columns
For a general solution add rowwise
:
library(dplyr)
data.frame(a = c(1:5, 6:10),
b = c(6:10, 1:5)) %>%
rowwise() %>%
mutate(MAX_COLUMN = max(c_across(a:b)))
# a b MAX_COLUMN
# <int> <int> <int>
# 1 1 6 6
# 2 2 7 7
# 3 3 8 8
# 4 4 9 9
# 5 5 10 10
# 6 6 1 6
# 7 7 2 7
# 8 8 3 8
# 9 9 4 9
#10 10 5 10
If you want to take max a faster option would be pmax
with do.call
.
data.frame(a = c(1:5, 6:10),
b = c(6:10, 1:5)) %>%
mutate(MAX_COLUMN = do.call(pmax, .))
Sum across multiple columns with dplyr
dplyr >= 1.0.0 using across
sum up each row using rowSums
(rowwise
works for any aggreation, but is slower)
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(across(where(is.numeric))))
sum down each column
df %>%
summarise(across(everything(), ~ sum(., is.na(.), 0)))
dplyr < 1.0.0
sum up each row
df %>%
replace(is.na(.), 0) %>%
mutate(sum = rowSums(.[1:5]))
sum down each column using superseeded summarise_all
:
df %>%
replace(is.na(.), 0) %>%
summarise_all(funs(sum))
Combine: rowwise(), mutate(), across(), for multiple functions
Using pmap()
from purrr
might be more preferable since you need to select the data just once and you can use the select helpers:
df %>%
mutate(pmap_dfr(across(where(is.numeric)),
~ data.frame(max = max(c(...)),
min = min(c(...)),
avg = mean(c(...)))))
a b c d e max min avg
<int> <int> <int> <chr> <int> <int> <int> <dbl>
1 1 6 11 a 1 11 1 4.75
2 2 7 12 b 2 12 2 5.75
3 3 8 13 c 3 13 3 6.75
4 4 9 14 d 4 14 4 7.75
5 5 10 15 e 5 15 5 8.75
Or with the addition of tidyr
:
df %>%
mutate(res = pmap(across(where(is.numeric)),
~ list(max = max(c(...)),
min = min(c(...)),
avg = mean(c(...))))) %>%
unnest_wider(res)
Apply `dplyr::rowwise` in all variables
This can be done using purrr::pmap
, which passes a list of arguments to a function that accepts "dots". Since most functions like mean
, sd
, etc. work with vectors, you need to pair the call with a domain lifter:
df_1 %>% select(-y) %>% mutate( var = pmap(., lift_vd(mean)) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 68.48282
# 2 49.40462 47.00752 21.99248 78.87789 49.32063
df_1 %>% select(-y) %>% mutate( var = pmap(., lift_vd(sd)) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 13.88555
# 2 49.40462 47.00752 21.99248 78.87789 23.27958
The function sum
accepts dots directly, so you don't need to lift its domain:
df_1 %>% select(-y) %>% mutate( var = pmap(., sum) )
# x.1 x.2 x.3 x.4 var
# 1 70.12072 62.99024 54.00672 86.81358 273.9313
# 2 49.40462 47.00752 21.99248 78.87789 197.2825
Everything conforms to the standard dplyr
data processing, so all three can be combined as separate arguments to mutate
:
df_1 %>% select(-y) %>%
mutate( v1 = pmap(., lift_vd(mean)),
v2 = pmap(., lift_vd(sd)),
v3 = pmap(., sum) )
# x.1 x.2 x.3 x.4 v1 v2 v3
# 1 70.12072 62.99024 54.00672 86.81358 68.48282 13.88555 273.9313
# 2 49.40462 47.00752 21.99248 78.87789 49.32063 23.27958 197.2825
Related Topics
How to Read the Source Code for an R Function
Reading Big Data with Fixed Width
Deleting Every N-Th Row in a Dataframe
How to Find Difference Between Values in Two Rows in an R Dataframe Using Dplyr
Output a Good-Looking Matrix Using Rendertable()
Asymmetric Expansion of Ggplot Axis Limits
Warning in Install.Packages: Unable to Move Temporary Installation
Convert Accented Characters into Ascii Character
Devtools::Install_Github() - Ignore Ssl Cert Verification Failure
Text-Mining with the Tm-Package - Word Stemming
Configuration Failed Because Libcurl Was Not Found
How to Access Global/Outer Scope Variable from R Apply Function