Return pmin or pmax of data.frame with multiple columns
Use do.call
to call pmax
to compare all the columns together for each row value, e.g.:
dat <- data.frame(a=1:5,b=rep(3,5))
# a b
#1 1 3
#2 2 3
#3 3 3
#4 4 3
#5 5 3
do.call(pmax,dat)
#[1] 3 3 3 4 5
When you call pmax
on an entire data.frame directly, it only has one argument passed to the function and nothing to compare it to. So, it just returns the supplied argument as it must be the maximum. It works for non-numeric and numeric arguments, even though it may not make much sense:
pmax(7)
#[1] 7
pmax("a")
#[1] "a"
pmax(data.frame(1,2,3))
# X1 X2 X3
#1 1 2 3
Using do.call(pmax,...)
with a data.frame means you pass each column of the data.frame as a list of arguments to pmax
:
do.call(pmax,dat)
is thus equivalent to:
pmax(dat$a, dat$b)
How to use a range for columns instead of names for pmax / pmin
Here's an option that does one function-call on all rows, all columns at once.
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:e)))
# a b c d e f g h i j k l m n o p q r s t u v w x y z maxcol
# 1 1 4 9 2 4 4 1 10 2 3 10 4 7 1 10 9 8 2 8 9 5 1 9 1 10 9 9
# 2 5 2 5 3 5 2 8 8 5 8 2 3 6 10 9 3 5 8 7 4 6 9 8 5 8 3 5
# 3 10 9 6 1 7 10 6 4 4 7 6 6 2 7 5 5 4 1 10 7 3 10 5 10 1 7 10
# 4 8 1 4 8 9 3 3 9 10 1 8 5 8 4 4 8 6 10 5 2 9 5 7 7 3 1 9
# 5 2 10 2 9 8 9 9 6 7 5 9 2 5 5 7 4 2 5 4 8 4 6 6 2 9 6 10
You can select some or all of the columns using the colon notation, even arbitrary columns:
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = c(a:e,g))))
# a b c d e f g h i j k l m n o p q r s t u v w x y z maxcol
# 1 1 4 9 2 4 4 1 10 2 3 10 4 7 1 10 9 8 2 8 9 5 1 9 1 10 9 9
# 2 5 2 5 3 5 2 8 8 5 8 2 3 6 10 9 3 5 8 7 4 6 9 8 5 8 3 8
# 3 10 9 6 1 7 10 6 4 4 7 6 6 2 7 5 5 4 1 10 7 3 10 5 10 1 7 10
# 4 8 1 4 8 9 3 3 9 10 1 8 5 8 4 4 8 6 10 5 2 9 5 7 7 3 1 9
# 5 2 10 2 9 8 9 9 6 7 5 9 2 5 5 7 4 2 5 4 8 4 6 6 2 9 6 10
The reason this should be preferred over the other answers (which are generally using allegedly idiomatic methods) is because:
- in Dom's answer, the
max
function is called once for each row of the frame; R's vectorized ops are not being used, this is inefficient and should be avoided if possible; - in akrun's answer,
pmax
is being called once for each column of the frame, which in this case might sound worse but actually closer to the best one can do. My answer is closest to akrun's in that we areselect
ing data within themutate
.
If you'd prefer to use dplyr::select
over base::subset
, it needs to be broken out as
foo %>%
mutate(maxcol = select(., a:e, g) %>% do.call(pmax, .))
I think this is demonstrated a little better with benchmarks. Using the provided 5x26 frame, we see a clear improvement:
set.seed(42)
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5)))
microbenchmark::microbenchmark(
Dom = {
foo %>%
rowwise() %>%
summarise(max= max(c_across(a:z)))
},
akr = {
foo %>%
mutate(maxcol = reduce(select(., a:z), pmax))
},
r2 = {
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:z)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 6.6561 7.15260 7.61574 7.38345 7.90375 11.0387 100
# akr 4.2849 4.69920 4.96278 4.86110 5.18130 7.0908 100
# r2 2.3290 2.49285 2.68671 2.59180 2.78960 4.7086 100
Let's try with a slightly larger 5000x26:
set.seed(42)
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5000,replace=TRUE)))
microbenchmark::microbenchmark(
Dom = {
foo %>%
rowwise() %>%
summarise(max= max(c_across(a:z)))
},
akr = {
foo %>%
mutate(maxcol = reduce(select(., a:z), pmax))
},
r2 = {
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:z)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 515.6437 563.6060 763.97348 811.45815 883.00115 1775.2366 100
# akr 4.6660 5.1619 11.92847 5.74050 6.50625 293.7444 100
# r2 2.9253 3.4371 4.24548 3.71845 4.27380 14.0958 100
This last one definitely shows a consequence of using rowwise
. The relative performance between akrun's answer and this one is almost identical to 5 rows, reinforcing the premise that column-wise is better than row-wise (and all-at-once is faster than both).
(This can also be done with purrr::invoke
, if truly desired, though it does not speed it up:
library(purrr)
foo %>%
mutate(maxcol = invoke(pmax, subset(., select = a:z)))
### microbenchmark(...)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 7.8292 8.40275 9.02813 8.97345 9.38500 12.4368 100
# akr 4.9622 5.28855 8.78909 5.60090 6.11790 309.2607 100
# r2base 2.5521 2.74635 3.01949 2.90415 3.21060 4.6512 100
# r2purrr 2.5063 2.77510 3.11206 2.93415 3.33015 5.2403 100
How to get pmax over multiple variables with dplyr?
Using do.call
we can evaluate pmax
without specifying the variables, i.e.
mtcars %>%
mutate(new = do.call(pmax, c(select(., c(1, 7)), na.rm = TRUE)))
# mpg cyl disp hp drat wt qsec vs am gear carb new
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.00
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.00
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.80
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.40
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.70
#6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 20.22
#7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 15.84
#...
Min and Max across multiple columns with NAs
You can use hablar
's min_
and max_
function which returns NA
if all values are NA
.
library(dplyr)
library(hablar)
dat %>%
rowwise() %>%
mutate(min = min_(c_across(-ID)),
max = max_(c_across(-ID)))
You can also use this with apply
-
cbind(dat, t(apply(dat[-1], 1, function(x) c(min = min_(x), max = max_(x)))))
# ID PM TP2 Sigma min max
#1 1 1 2 3 1 3
#2 2 0 NA 1 0 1
#3 3 2 1 NA 1 2
#4 4 1 0 2 0 2
#5 NA NA NA NA NA NA
#6 5 2 0 7 0 7
r - Find corresponding value from multiple columns according to pmin in multiple columns
We can use row/column
indexing to extract the elements of 'P1/P2' columns that corresponds to the 'D1', 'D2'
m1 <- cbind(seq_len(nrow(df)), match(df$num, c("D1", "D2", "D3")))
df$NP <- df[c("P1", "P2", "P3")][m1]
df$NP
#[1] 11 11 11 40 22
data
df <- structure(list(Item = c("A", "B", "C", "D", "E"), P = c(10L,
10L, 10L, 50L, 20L), P1 = c(8L, 8L, 8L, 40L, 15L), P2 = c(11L,
11L, 11L, 35L, 22L), P3 = c(20L, 20L, 20L, 70L, 30L), D1 = c(2L,
2L, 2L, 10L, 5L), D2 = c(1L, 1L, 1L, 15L, 2L), D3 = c(10L, 10L,
10L, 20L, 10L), pmin = c(1L, 1L, 1L, 10L, 2L), num = c("D2",
"D2", "D2", "D1", "D2"), NP = c(11L, 11L, 11L, 40L, 22L)),
class = "data.frame", row.names = c(NA,
-5L))
Using pmax/pmin with vector of variable string names in R
We may use invoke
(similar to do.call
in base R
) with across
library(purrr)
library(dplyr)
out <- mtcars %>%
mutate(maxval = invoke(pmax, c(across(all_of(values)), na.rm = TRUE)))
# or use do.call
# mutate(maxval = do.call(pmax, c(across(all_of(values)), na.rm = TRUE)))
-output
> head(out)
mpg cyl disp hp drat wt qsec vs am gear carb maxval
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.900
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.900
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.850
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3.440
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.460
Or may use exec
as well
out2 <- mtcars %>%
mutate(maxval = exec(pmax, !!! rlang::syms(values), na.rm = TRUE))
-output
> head(out2)
mpg cyl disp hp drat wt qsec vs am gear carb maxval
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.900
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.900
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.850
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3.440
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.460
Find the earliest and latest date within each row in R
We can use pmax
and pmin
on the 'date' columns to return the earliest and latest date for each row
library(dplyr)
df %>%
mutate(max_date = do.call(pmax, c(select(., starts_with('date')), na.rm = TRUE)),
min_date = do.call(pmin, c(select(., starts_with('date')),
na.rm = TRUE)))
# ID Other_columns date_column date_column2 date_column3 max_date min_date
#1 1 numeric 2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50 2019-11-04 19:33:50
#2 2 numeric <NA> 2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50 2019-11-04 17:20:10
#3 3 numeric 2019-11-07 20:33:50 <NA> 2019-11-04 18:31:50 2019-11-07 20:33:50 2019-11-04 18:31:50
#4 4 <NA> <NA> <NA> <NA> <NA> <NA>
Or another option with rowwise
with c_across
df %>%
rowwise() %>%
mutate(max_date = max(as.POSIXct(c_across(starts_with('date'))),
na.rm = TRUE),
min_date = min(as.POSIXct(c_across(starts_with('date'))),
na.rm = TRUE))
-output
# A tibble: 4 x 7
# Rowwise:
# ID Other_columns date_column date_column2 date_column3 max_date min_date
# <int> <chr> <chr> <chr> <chr> <dttm> <dttm>
#1 1 numeric 2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50 2019-11-04 19:33:50
#2 2 numeric <NA> 2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50 2019-11-04 17:20:10
#3 3 numeric 2019-11-07 20:33:50 <NA> 2019-11-04 18:31:50 2019-11-07 20:33:50 2019-11-04 18:31:50
#4 4 <NA> <NA> <NA> <NA> NA NA NA NA
data
df <- structure(list(ID = 1:4, Other_columns = c("numeric", "numeric",
"numeric", NA), date_column = c("2019-11-04 19:33:50", NA, "2019-11-07 20:33:50",
NA), date_column2 = c("2019-11-05 15:33:50", "2019-11-04 17:20:10",
NA, NA), date_column3 = c("2019-11-05 16:33:50", "2019-11-09 19:12:50",
"2019-11-04 18:31:50", NA)), class = "data.frame", row.names = c(NA,
-4L))
How to repeat rows by their value by multiple columns and divide back
One option would be to get max value from columns B
, C
and D
using pmax
, use uncount
to repeat the rows. Use pmin
to replace the values greater than 1 to 1.
library(dplyr)
library(tidyr)
df %>%
mutate(repeat_row = pmax(B, C, D)) %>%
uncount(repeat_row) %>%
mutate(across(-A, pmin, 1))
# A B C D
#1 1 0 1 0
#2 2 0 0 1
#3 2 0 0 1
#4 3 1 0 0
#5 3 1 0 0
#6 3 1 0 0
#7 4 0 1 0
#8 5 0 0 1
How to truncate multiple columns in R
Write the code that you want to apply to each column in a function and apply it with across
.
library(dplyr)
func <- function(a) {
case_when(a >= 3.0 ~ 3.0,
a <= -3.0 ~ -3.0,
T ~ a)
}
MyData %>%
mutate(across(.fns = func, .names = 'T{col}'))
# a b c Ta Tb Tc
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2.3 3.6 1 2.3 3 1
#2 3 1.52 -2.6 3 1.52 -2.6
#3 -1.5 -5.4 -1.2 -1.5 -3 -1.2
#4 3.7 4.6 2.5 3 3 2.5
#5 -4.7 1.5 -4 -3 1.5 -3
#6 5.2 2.2 3 3 2.2 3
Return max for each column, grouped by ID
For summarisation, tidyverse
is more flexible especially the across
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(everything(), max))
-output
# A tibble: 3 x 3
# ID col1 col2
#* <chr> <dbl> <dbl>
#1 A 5 10
#2 B 2 4
#3 C 3 6
data
df <- data.frame(ID, col1, col2)
Related Topics
Finding Where Two Linear Fits Intersect in R
How to Train a Ml Model in Sparklyr and Predict New Values on Another Dataframe
How to Change Angle of Line in Customized Legend in Ggplot2
In Read.Table(): Incomplete Final Line Found by Readtableheader
Extract First Word from a Column and Insert into New Column
Minus Operation of Data Frames
Unknown Timezone Name in R Strptime/As.Posixct
Ggplot2: Fix Colors to Factor Levels
R: Calculate Cosine Distance from a Term-Document Matrix with Tm and Proxy
How to Connect to a Remote Server with Ssh in R
How to Filter Data Frame with Conditions of Two Columns
Dealing with Spaces and "Weird" Characters in Column Names with Dplyr::Rename()
How to Calculate Mean of All Columns, by Group
Removing Unused Factors from a Facet in Ggplot2