Return Pmin or Pmax of Data.Frame with Multiple Columns

Return pmin or pmax of data.frame with multiple columns

Use do.call to call pmax to compare all the columns together for each row value, e.g.:

dat <- data.frame(a=1:5,b=rep(3,5))

# a b
#1 1 3
#2 2 3
#3 3 3
#4 4 3
#5 5 3

do.call(pmax,dat)
#[1] 3 3 3 4 5

When you call pmax on an entire data.frame directly, it only has one argument passed to the function and nothing to compare it to. So, it just returns the supplied argument as it must be the maximum. It works for non-numeric and numeric arguments, even though it may not make much sense:

pmax(7)
#[1] 7

pmax("a")
#[1] "a"

pmax(data.frame(1,2,3))
# X1 X2 X3
#1 1 2 3

Using do.call(pmax,...) with a data.frame means you pass each column of the data.frame as a list of arguments to pmax:

do.call(pmax,dat) 

is thus equivalent to:

pmax(dat$a, dat$b)

How to use a range for columns instead of names for pmax / pmin

Here's an option that does one function-call on all rows, all columns at once.

foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:e)))
# a b c d e f g h i j k l m n o p q r s t u v w x y z maxcol
# 1 1 4 9 2 4 4 1 10 2 3 10 4 7 1 10 9 8 2 8 9 5 1 9 1 10 9 9
# 2 5 2 5 3 5 2 8 8 5 8 2 3 6 10 9 3 5 8 7 4 6 9 8 5 8 3 5
# 3 10 9 6 1 7 10 6 4 4 7 6 6 2 7 5 5 4 1 10 7 3 10 5 10 1 7 10
# 4 8 1 4 8 9 3 3 9 10 1 8 5 8 4 4 8 6 10 5 2 9 5 7 7 3 1 9
# 5 2 10 2 9 8 9 9 6 7 5 9 2 5 5 7 4 2 5 4 8 4 6 6 2 9 6 10

You can select some or all of the columns using the colon notation, even arbitrary columns:

foo %>%
mutate(maxcol = do.call(pmax, subset(., select = c(a:e,g))))
# a b c d e f g h i j k l m n o p q r s t u v w x y z maxcol
# 1 1 4 9 2 4 4 1 10 2 3 10 4 7 1 10 9 8 2 8 9 5 1 9 1 10 9 9
# 2 5 2 5 3 5 2 8 8 5 8 2 3 6 10 9 3 5 8 7 4 6 9 8 5 8 3 8
# 3 10 9 6 1 7 10 6 4 4 7 6 6 2 7 5 5 4 1 10 7 3 10 5 10 1 7 10
# 4 8 1 4 8 9 3 3 9 10 1 8 5 8 4 4 8 6 10 5 2 9 5 7 7 3 1 9
# 5 2 10 2 9 8 9 9 6 7 5 9 2 5 5 7 4 2 5 4 8 4 6 6 2 9 6 10

The reason this should be preferred over the other answers (which are generally using allegedly idiomatic methods) is because:

  • in Dom's answer, the max function is called once for each row of the frame; R's vectorized ops are not being used, this is inefficient and should be avoided if possible;
  • in akrun's answer, pmax is being called once for each column of the frame, which in this case might sound worse but actually closer to the best one can do. My answer is closest to akrun's in that we are selecting data within the mutate.

If you'd prefer to use dplyr::select over base::subset, it needs to be broken out as

foo %>%
mutate(maxcol = select(., a:e, g) %>% do.call(pmax, .))

I think this is demonstrated a little better with benchmarks. Using the provided 5x26 frame, we see a clear improvement:

set.seed(42)
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5)))
microbenchmark::microbenchmark(
Dom = {
foo %>%
rowwise() %>%
summarise(max= max(c_across(a:z)))
},
akr = {
foo %>%
mutate(maxcol = reduce(select(., a:z), pmax))
},
r2 = {
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:z)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 6.6561 7.15260 7.61574 7.38345 7.90375 11.0387 100
# akr 4.2849 4.69920 4.96278 4.86110 5.18130 7.0908 100
# r2 2.3290 2.49285 2.68671 2.59180 2.78960 4.7086 100

Let's try with a slightly larger 5000x26:

set.seed(42)
foo <- data.frame(sapply(letters, function(x) x = sample(1:10,5000,replace=TRUE)))
microbenchmark::microbenchmark(
Dom = {
foo %>%
rowwise() %>%
summarise(max= max(c_across(a:z)))
},
akr = {
foo %>%
mutate(maxcol = reduce(select(., a:z), pmax))
},
r2 = {
foo %>%
mutate(maxcol = do.call(pmax, subset(., select = a:z)))
}
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 515.6437 563.6060 763.97348 811.45815 883.00115 1775.2366 100
# akr 4.6660 5.1619 11.92847 5.74050 6.50625 293.7444 100
# r2 2.9253 3.4371 4.24548 3.71845 4.27380 14.0958 100

This last one definitely shows a consequence of using rowwise. The relative performance between akrun's answer and this one is almost identical to 5 rows, reinforcing the premise that column-wise is better than row-wise (and all-at-once is faster than both).

(This can also be done with purrr::invoke, if truly desired, though it does not speed it up:

library(purrr)
foo %>%
mutate(maxcol = invoke(pmax, subset(., select = a:z)))

### microbenchmark(...)
# Unit: milliseconds
# expr min lq mean median uq max neval
# Dom 7.8292 8.40275 9.02813 8.97345 9.38500 12.4368 100
# akr 4.9622 5.28855 8.78909 5.60090 6.11790 309.2607 100
# r2base 2.5521 2.74635 3.01949 2.90415 3.21060 4.6512 100
# r2purrr 2.5063 2.77510 3.11206 2.93415 3.33015 5.2403 100

How to get pmax over multiple variables with dplyr?

Using do.call we can evaluate pmax without specifying the variables, i.e.

mtcars %>% 
mutate(new = do.call(pmax, c(select(., c(1, 7)), na.rm = TRUE)))

# mpg cyl disp hp drat wt qsec vs am gear carb new
#1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.00
#2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.00
#3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.80
#4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.40
#5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.70
#6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 20.22
#7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 15.84
#...

Min and Max across multiple columns with NAs

You can use hablar's min_ and max_ function which returns NA if all values are NA.

library(dplyr)
library(hablar)

dat %>%
rowwise() %>%
mutate(min = min_(c_across(-ID)),
max = max_(c_across(-ID)))

You can also use this with apply -

cbind(dat, t(apply(dat[-1], 1, function(x) c(min = min_(x), max = max_(x)))))

# ID PM TP2 Sigma min max
#1 1 1 2 3 1 3
#2 2 0 NA 1 0 1
#3 3 2 1 NA 1 2
#4 4 1 0 2 0 2
#5 NA NA NA NA NA NA
#6 5 2 0 7 0 7

r - Find corresponding value from multiple columns according to pmin in multiple columns

We can use row/column indexing to extract the elements of 'P1/P2' columns that corresponds to the 'D1', 'D2'

m1 <- cbind(seq_len(nrow(df)), match(df$num, c("D1", "D2", "D3")))
df$NP <- df[c("P1", "P2", "P3")][m1]
df$NP
#[1] 11 11 11 40 22

data

df <- structure(list(Item = c("A", "B", "C", "D", "E"), P = c(10L, 
10L, 10L, 50L, 20L), P1 = c(8L, 8L, 8L, 40L, 15L), P2 = c(11L,
11L, 11L, 35L, 22L), P3 = c(20L, 20L, 20L, 70L, 30L), D1 = c(2L,
2L, 2L, 10L, 5L), D2 = c(1L, 1L, 1L, 15L, 2L), D3 = c(10L, 10L,
10L, 20L, 10L), pmin = c(1L, 1L, 1L, 10L, 2L), num = c("D2",
"D2", "D2", "D1", "D2"), NP = c(11L, 11L, 11L, 40L, 22L)),
class = "data.frame", row.names = c(NA,
-5L))

Using pmax/pmin with vector of variable string names in R

We may use invoke (similar to do.call in base R) with across

library(purrr)
library(dplyr)
out <- mtcars %>%
mutate(maxval = invoke(pmax, c(across(all_of(values)), na.rm = TRUE)))
# or use do.call
# mutate(maxval = do.call(pmax, c(across(all_of(values)), na.rm = TRUE)))

-output

> head(out)
mpg cyl disp hp drat wt qsec vs am gear carb maxval
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.900
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.900
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.850
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3.440
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.460

Or may use exec as well

out2 <- mtcars %>%
mutate(maxval = exec(pmax, !!! rlang::syms(values), na.rm = TRUE))

-output

> head(out2)
mpg cyl disp hp drat wt qsec vs am gear carb maxval
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.900
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.900
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.850
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.215
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3.440
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 3.460

Find the earliest and latest date within each row in R

We can use pmax and pmin on the 'date' columns to return the earliest and latest date for each row

library(dplyr)
df %>%
mutate(max_date = do.call(pmax, c(select(., starts_with('date')), na.rm = TRUE)),
min_date = do.call(pmin, c(select(., starts_with('date')),
na.rm = TRUE)))
# ID Other_columns date_column date_column2 date_column3 max_date min_date
#1 1 numeric 2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50 2019-11-04 19:33:50
#2 2 numeric <NA> 2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50 2019-11-04 17:20:10
#3 3 numeric 2019-11-07 20:33:50 <NA> 2019-11-04 18:31:50 2019-11-07 20:33:50 2019-11-04 18:31:50
#4 4 <NA> <NA> <NA> <NA> <NA> <NA>

Or another option with rowwise with c_across

df %>% 
rowwise() %>%
mutate(max_date = max(as.POSIXct(c_across(starts_with('date'))),
na.rm = TRUE),
min_date = min(as.POSIXct(c_across(starts_with('date'))),
na.rm = TRUE))

-output

# A tibble: 4 x 7
# Rowwise:
# ID Other_columns date_column date_column2 date_column3 max_date min_date
# <int> <chr> <chr> <chr> <chr> <dttm> <dttm>
#1 1 numeric 2019-11-04 19:33:50 2019-11-05 15:33:50 2019-11-05 16:33:50 2019-11-05 16:33:50 2019-11-04 19:33:50
#2 2 numeric <NA> 2019-11-04 17:20:10 2019-11-09 19:12:50 2019-11-09 19:12:50 2019-11-04 17:20:10
#3 3 numeric 2019-11-07 20:33:50 <NA> 2019-11-04 18:31:50 2019-11-07 20:33:50 2019-11-04 18:31:50
#4 4 <NA> <NA> <NA> <NA> NA NA NA NA

data

df <- structure(list(ID = 1:4, Other_columns = c("numeric", "numeric", 
"numeric", NA), date_column = c("2019-11-04 19:33:50", NA, "2019-11-07 20:33:50",
NA), date_column2 = c("2019-11-05 15:33:50", "2019-11-04 17:20:10",
NA, NA), date_column3 = c("2019-11-05 16:33:50", "2019-11-09 19:12:50",
"2019-11-04 18:31:50", NA)), class = "data.frame", row.names = c(NA,
-4L))


How to repeat rows by their value by multiple columns and divide back

One option would be to get max value from columns B, C and D using pmax, use uncount to repeat the rows. Use pmin to replace the values greater than 1 to 1.

library(dplyr)
library(tidyr)

df %>%
mutate(repeat_row = pmax(B, C, D)) %>%
uncount(repeat_row) %>%
mutate(across(-A, pmin, 1))

# A B C D
#1 1 0 1 0
#2 2 0 0 1
#3 2 0 0 1
#4 3 1 0 0
#5 3 1 0 0
#6 3 1 0 0
#7 4 0 1 0
#8 5 0 0 1

How to truncate multiple columns in R

Write the code that you want to apply to each column in a function and apply it with across.

library(dplyr)

func <- function(a) {
case_when(a >= 3.0 ~ 3.0,
a <= -3.0 ~ -3.0,
T ~ a)
}

MyData %>%
mutate(across(.fns = func, .names = 'T{col}'))

# a b c Ta Tb Tc
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2.3 3.6 1 2.3 3 1
#2 3 1.52 -2.6 3 1.52 -2.6
#3 -1.5 -5.4 -1.2 -1.5 -3 -1.2
#4 3.7 4.6 2.5 3 3 2.5
#5 -4.7 1.5 -4 -3 1.5 -3
#6 5.2 2.2 3 3 2.2 3

Return max for each column, grouped by ID

For summarisation, tidyverse is more flexible especially the across

library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(everything(), max))

-output

# A tibble: 3 x 3
# ID col1 col2
#* <chr> <dbl> <dbl>
#1 A 5 10
#2 B 2 4
#3 C 3 6

data

 df <- data.frame(ID, col1, col2)


Related Topics



Leave a reply



Submit