dplyr filter: Get rows with minimum of variable, but only the first if multiple minima
Just for completeness: Here's the final dplyr
solution, derived from the comments of @hadley and @Arun:
library(dplyr)
df.g <- group_by(df, A)
filter(df.g, rank(x, ties.method="first")==1)
filter rows by minimum value relative to a factor column
df %>%
group_by(Some_Factor) %>%
filter(Value == min(Value))
Filter maximum and minimum values' of multiple columns in R
You can get the data in long format, convert factor values to numeric using parse_number
and for each column name select max
and min
rows.
library(dplyr)
df %>%
tidyr::pivot_longer(cols = c(month_pct, year_pct)) %>%
mutate(value = readr::parse_number(as.character(value))) %>%
group_by(name) %>%
slice(which.min(value), which.max(value)) %>%
mutate(max_min = c('min', 'max'), .before = 'id')
# max_min id price name value
# <chr> <int> <dbl> <chr> <dbl>
#1 min 10 1.77 month_pct -19.9
#2 max 1 40.6 month_pct 8.53
#3 min 1 40.6 year_pct -35.3
#4 max 7 54.8 year_pct 1.54
Filter based on minimum date differences greater than zero dplyr
We can replace the values that are less than or equal to 0 with an NA
, then use which.min
inside of slice
so that we don't have to create a new column.
library(tidyverse)
cheese %>%
mutate(measurement_date = as.Date(measurement_date)) %>%
group_by(sample_id, variable, measurement_date) %>%
slice(which.min((measurement_date - date2)*NA^((measurement_date - date2) <=0)))
Output
sample_id variable value measurement_date date2
<dbl> <chr> <dbl> <date> <date>
1 1 a 3.39 2021-06-01 2021-03-26
2 1 b 9.50 2021-06-01 2021-03-20
3 1 b 2.85 2021-08-22 2021-08-05
You could also directly use replace
instead of the shorter notation.
cheese %>%
mutate(measurement_date = as.Date(measurement_date)) %>%
group_by(sample_id, variable, measurement_date) %>%
slice(which.min(replace((measurement_date - date2), (measurement_date - date2)<=0, NA)))
Data
cheese <- structure(list(sample_id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1), variable = c("a", "a", "a", "a", "a", "a", "b", "b", "b",
"b", "b", "b"), value = c(3.3895779682789, 4.34911509673111,
6.15568027016707, 9.17387010995299, 2.8151373793371, 9.08550716470927,
9.50207741744816, 6.94718013238162, 6.66202639508992, 1.55607643420808,
2.85377117409371, 2.58901077276096), measurement_date = c("2021-06-01",
"2021-06-01", "2021-06-01", "2021-08-22", "2021-08-22", "2021-08-22",
"2021-06-01", "2021-06-01", "2021-06-01", "2021-08-22", "2021-08-22",
"2021-08-22"), date2 = structure(c(18712, 18904, 18989, 18957,
18890, 18956, 18706, 18840, 18664, 18732, 18844, 18792), class = "Date")), class = "data.frame", row.names = c(NA,
-12L))
dplyrGet rows with minimum and maximum of variable
Try
df %>%
group_by(set_nbr) %>%
filter(time==max(time))
# time Mz set_nbr
#1 29495.45 -0.50902297 1
#2 39297.27 -0.22218980 2
#3 29495.45 -0.00999671 3
Or
df %>%
group_by(set_nbr) %>%
slice(which.max(time))
# time Mz set_nbr
#1 29495.45 -0.50902297 1
#2 39297.27 -0.22218980 2
#3 29495.45 -0.00999671 3
Regarding why your code didn't work
df %>%
group_by(set_nbr) %>%
slice(which(Mz <0)) %>%
mutate(rn = rank(time, ties.method='max'))
# time Mz set_nbr rn
#1 24594.55 -0.04729751 1 2
#2 29495.45 -0.50902297 1 3
#3 24594.55 -0.04376393 1 2
#4 39297.27 -0.22218980 2 3
#5 24594.55 -0.36407263 2 1
#6 34396.36 -0.38341534 2 2
#7 19693.64 -0.34597255 3 2
#8 14792.73 -0.01480776 3 1
#9 29495.45 -0.00999671 3 3
If you look at the output, for the 'set_nbr' group '1', there is no '1' for 'rn' as there were ties. You could do
df %>%
group_by(set_nbr) %>%
slice(which(Mz <0)) %>%
filter(rn = rank(-time, ties.method='first')==1)
# time Mz set_nbr
#1 29495.45 -0.50902297 1
#2 39297.27 -0.22218980 2
#3 29495.45 -0.00999671 3
Simplify dplyr code in R for selecting minimum value in a dataset
If we want to avoid the repeated assignment, use a chain (%>%
). It seems that these steps are unique steps that may not be possible to simplify in dplyr
library(dplyr)
product %>%
select(Date, Price) %>%
filter(Grade == 'Premium') %>%
arrange(Price) %>%
slice_head(3)
In base R
, we may simplify this
out <- subset(product, select = c(Date, Price), subset = Grade == 'Premium')
head(out[order(out$Price),], 3)
Extract row corresponding to minimum value of a variable by group
Slightly more elegant:
library(data.table)
DT[ , .SD[which.min(Employees)], by = State]
State Company Employees
1: AK D 24
2: RI E 19
Slighly less elegant than using .SD
, but a bit faster (for data with many groups):
DT[DT[ , .I[which.min(Employees)], by = State]$V1]
Also, just replace the expression which.min(Employees)
with Employees == min(Employees)
, if your data set has multiple identical min values and you'd like to subset all of them.
See also Subset rows corresponding to max value by group using data.table.
Filter data but keep at least one row for each ID
You can use tidyr::complete()
:
df %>%
filter(col1 == 1 | col2 == 1) %>%
tidyr::complete(id = df$id, fill = list(col3 = "-"))
# # A tibble: 4 × 4
# id col1 col2 col3
# <chr> <dbl> <dbl> <chr>
# 1 a 1 0 A
# 2 a 1 1 B
# 3 b NA NA -
# 4 c 0 1 E
find minimum of 2 columns from a data frame (minimize 2 columns at the same time) in R
If you arrange the data by X
and Y
, you can select the 1st row of the dataframe.
In dplyr
that would be -
library(dplyr)
df %>% arrange(X, Y) %>% slice(1L)
# X Y
#1 1 1
Or in base R -
df[order(df$X, df$Y)[1], ]
Efficient way to filter only first rows where condition is met?
You can use:
data %>%
filter(data.table::rleid(label) == 1)
# A tibble: 2 x 2
label index
<chr> <int>
1 a 1
2 a 2
Related Topics
Pasting Two Vectors With Combinations of All Vectors' Elements
Add a New Column of the Sum by Group
Error in Plot.New(): Figure Margins Too Large in R
How to Use Grep()/Gsub() to Find Exact Match
Overlay Normal Curve to Histogram in R
Ggplot, Facet, Piechart: Placing Text in the Middle of Pie Chart Slices
Collapsing Rows Where Some Are All Na, Others Are Disjoint With Some Nas
How to Extract Plot Axes' Ranges For a Ggplot2 Object
Why Is Rbindlist "Better" Than Rbind
Remove an Entire Column from a Data.Frame in R
Find Which Season a Particular Date Belongs To
How to Send an Email With Attachment from R in Windows
Rcpp Pass by Reference Vs. by Value
Generate a Sequence of the Last Day of the Month Over Two Years