Dplyr Filter: Get Rows With Minimum of Variable, But Only the First If Multiple Minima

dplyr filter: Get rows with minimum of variable, but only the first if multiple minima

Just for completeness: Here's the final dplyr solution, derived from the comments of @hadley and @Arun:

library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)

filter rows by minimum value relative to a factor column

df %>%
group_by(Some_Factor) %>%
filter(Value == min(Value))

Filter maximum and minimum values' of multiple columns in R

You can get the data in long format, convert factor values to numeric using parse_number and for each column name select max and min rows.

library(dplyr)

df %>%
tidyr::pivot_longer(cols = c(month_pct, year_pct)) %>%
mutate(value = readr::parse_number(as.character(value))) %>%
group_by(name) %>%
slice(which.min(value), which.max(value)) %>%
mutate(max_min = c('min', 'max'), .before = 'id')

# max_min id price name value
# <chr> <int> <dbl> <chr> <dbl>
#1 min 10 1.77 month_pct -19.9
#2 max 1 40.6 month_pct 8.53
#3 min 1 40.6 year_pct -35.3
#4 max 7 54.8 year_pct 1.54

Filter based on minimum date differences greater than zero dplyr

We can replace the values that are less than or equal to 0 with an NA, then use which.min inside of slice so that we don't have to create a new column.

library(tidyverse)

cheese %>%
mutate(measurement_date = as.Date(measurement_date)) %>%
group_by(sample_id, variable, measurement_date) %>%
slice(which.min((measurement_date - date2)*NA^((measurement_date - date2) <=0)))

Output

  sample_id variable value measurement_date date2     
<dbl> <chr> <dbl> <date> <date>
1 1 a 3.39 2021-06-01 2021-03-26
2 1 b 9.50 2021-06-01 2021-03-20
3 1 b 2.85 2021-08-22 2021-08-05

You could also directly use replace instead of the shorter notation.

cheese %>%
mutate(measurement_date = as.Date(measurement_date)) %>%
group_by(sample_id, variable, measurement_date) %>%
slice(which.min(replace((measurement_date - date2), (measurement_date - date2)<=0, NA)))

Data

cheese <- structure(list(sample_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1), variable = c("a", "a", "a", "a", "a", "a", "b", "b", "b",
"b", "b", "b"), value = c(3.3895779682789, 4.34911509673111,
6.15568027016707, 9.17387010995299, 2.8151373793371, 9.08550716470927,
9.50207741744816, 6.94718013238162, 6.66202639508992, 1.55607643420808,
2.85377117409371, 2.58901077276096), measurement_date = c("2021-06-01",
"2021-06-01", "2021-06-01", "2021-08-22", "2021-08-22", "2021-08-22",
"2021-06-01", "2021-06-01", "2021-06-01", "2021-08-22", "2021-08-22",
"2021-08-22"), date2 = structure(c(18712, 18904, 18989, 18957,
18890, 18956, 18706, 18840, 18664, 18732, 18844, 18792), class = "Date")), class = "data.frame", row.names = c(NA,
-12L))

dplyrGet rows with minimum and maximum of variable

Try

df %>% 
group_by(set_nbr) %>%
filter(time==max(time))
# time Mz set_nbr
#1 29495.45 -0.50902297 1
#2 39297.27 -0.22218980 2
#3 29495.45 -0.00999671 3

Or

 df %>%
group_by(set_nbr) %>%
slice(which.max(time))
# time Mz set_nbr
#1 29495.45 -0.50902297 1
#2 39297.27 -0.22218980 2
#3 29495.45 -0.00999671 3

Regarding why your code didn't work

 df %>% 
group_by(set_nbr) %>%
slice(which(Mz <0)) %>%
mutate(rn = rank(time, ties.method='max'))
# time Mz set_nbr rn
#1 24594.55 -0.04729751 1 2
#2 29495.45 -0.50902297 1 3
#3 24594.55 -0.04376393 1 2
#4 39297.27 -0.22218980 2 3
#5 24594.55 -0.36407263 2 1
#6 34396.36 -0.38341534 2 2
#7 19693.64 -0.34597255 3 2
#8 14792.73 -0.01480776 3 1
#9 29495.45 -0.00999671 3 3

If you look at the output, for the 'set_nbr' group '1', there is no '1' for 'rn' as there were ties. You could do

 df %>% 
group_by(set_nbr) %>%
slice(which(Mz <0)) %>%
filter(rn = rank(-time, ties.method='first')==1)
# time Mz set_nbr
#1 29495.45 -0.50902297 1
#2 39297.27 -0.22218980 2
#3 29495.45 -0.00999671 3

Simplify dplyr code in R for selecting minimum value in a dataset

If we want to avoid the repeated assignment, use a chain (%>%). It seems that these steps are unique steps that may not be possible to simplify in dplyr

 library(dplyr)
product %>%
select(Date, Price) %>%
filter(Grade == 'Premium') %>%
arrange(Price) %>%
slice_head(3)

In base R, we may simplify this

out <- subset(product, select = c(Date, Price), subset = Grade == 'Premium')
head(out[order(out$Price),], 3)

Extract row corresponding to minimum value of a variable by group

Slightly more elegant:

library(data.table)
DT[ , .SD[which.min(Employees)], by = State]

State Company Employees
1: AK D 24
2: RI E 19

Slighly less elegant than using .SD, but a bit faster (for data with many groups):

DT[DT[ , .I[which.min(Employees)], by = State]$V1]

Also, just replace the expression which.min(Employees) with Employees == min(Employees), if your data set has multiple identical min values and you'd like to subset all of them.

See also Subset rows corresponding to max value by group using data.table.

Filter data but keep at least one row for each ID

You can use tidyr::complete():

df %>%
filter(col1 == 1 | col2 == 1) %>%
tidyr::complete(id = df$id, fill = list(col3 = "-"))

# # A tibble: 4 × 4
# id col1 col2 col3
# <chr> <dbl> <dbl> <chr>
# 1 a 1 0 A
# 2 a 1 1 B
# 3 b NA NA -
# 4 c 0 1 E

find minimum of 2 columns from a data frame (minimize 2 columns at the same time) in R

If you arrange the data by X and Y, you can select the 1st row of the dataframe.

In dplyr that would be -

library(dplyr)

df %>% arrange(X, Y) %>% slice(1L)

# X Y
#1 1 1

Or in base R -

df[order(df$X, df$Y)[1], ]

Efficient way to filter only first rows where condition is met?

You can use:

data %>% 
filter(data.table::rleid(label) == 1)

# A tibble: 2 x 2
label index
<chr> <int>
1 a 1
2 a 2


Related Topics



Leave a reply



Submit