find value closest to x by group in dplyr
which.min()
returns the index of the (first) minimum or maximum of a numeric (or logical) vector. If there are multiple equal values as close to 1.43 as each other and you want to keep all of them, you can use filter()
:
a %>% group_by(id) %>% filter(abs(b - 1.43) == min(abs(b - 1.43)))
#Source: local data frame [2 x 2]
#Groups: id [2]
# id b
# <chr> <dbl>
#1 A 1.5
#2 B 1.4
If you prefer sticking with the nth()
function, and it is OK to have only one value for each group, you can wrap it within a summarize function so that it will be applied to each group, and also according to ?nth()
, you need to pass the vector to the function as an argument as well:
a %>% group_by(id) %>% summarise(b = nth(b, which.min(abs(b-1.43))))
# A tibble: 2 × 2
# id b
# <chr> <dbl>
#1 A 1.5
#2 B 1.4
In R: find the closest value within group_by excluding self comparisons
Answered it using a question I asked years ago Count values less than x and find nearest values to x by multiple groups
temp1 <- data%>%
group_by(river) %>%
mutate(n_ds = match(dist,sort(dist))-1) %>%
mutate(closest_uid=apply(sapply(dist, function(i)abs(i-dist)), 2, function(n) id[which(n==sort(n)[2])])) %>%
data.frame()
tempdist <- temp1 %>% select(dist, id) %>% rename(rivDist = dist)
temp2 <- temp1 %>% left_join(tempdist, by = c('closest_uid' = 'id')) %>%
mutate(mindist = abs(dist - rivDist)
Find the nearest value in a column of grouped data, and then their corresponding rows in R
The which
should be inside the summarise
library(dplyr)
df %>%
group_by(treat) %>%
summarise(i = which.min(abs(x - my.val)))
Or if we have multiple element in 'my.val'
library(purrr)
df %>%
group_by(treat) %>%
summarise(i = map_int(my.val, ~ which.min(abs(x - .x))))
Or may use findInterval
as well
df %>%
group_by(treat) %>%
summarise(i = findInterval(my.val, x))
Find the closest value in the group for each value in the group R
Are you after something like below?
setDT(df)[
,
c(
.SD[sex == "F"],
.(closestM_id = id[sex == "M"][max.col(-abs(outer(
time[sex == "F"],
time[sex == "M"], "-"
)))])
), group
]
which gives
group id sex time closestM_id
1: 1 2 F 11.0 3
2: 2 6 F 15.0 5
3: 2 7 F 9.0 5
4: 2 8 F 7.4 5
5: 3 12 F 21.0 9
Data
> dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), sex = c("M",
"F", "M", "M", "M", "F", "F", "F", "M", "M", "M", "F"), time = c(10,
11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)), class = "data.frame", row.names = c(NA,
-12L))
Find the closest value for a certain year in R
Here are three approaches. The first one is the clearest as it shows that the problem is really an aggregated and filtered self-join and directly models this and automatically handles the edge case mentioned in the comments without additional code. The second one uses a lapply
loop to get the desired effect but it involves more tedious manipulation although it does have the advantage of zero package dependencies. The last one gets around the fact that dplyr lacks complex self joins by performing a left join twice.
1) sqldf Using DF
defined reproducibly in the Note at the end perform a self join such that the difference in years is -2, -1, 1 or 2 and the iso3 codes are the same and cata10 is not NA in matching instance and among those rows we use min(...)
to find the row having the minimum absolute difference in the year. This uses the fact that SQLite has the feature that min(...)
will cause the entire row to be returned that satisfies the minimizing condition. Finally take only the 2012 and 2017 rows. The ability of SQL to directly model the constraints using a complex join allows us to directly model the requirements into code.
library(sqldf)
sqldf("select
a.iso3year iso3year_UHC,
a.UHC,
substr(b.iso3year, 5, 8) year_cata,
b.cata10,
substr(a.iso3year, 5, 8) year,
min(abs(substr(a.iso3year, 5, 8) - substr(b.iso3year, 5, 8))) min_value
from DF a
left join DF b on year - year_cata in (-2, -1, 1, 2) and
substr(a.iso3year, 1, 3) = substr(b.iso3year, 1, 3) and
b.cata10 is not null
group by a.iso3year
having year in ('2012', '2017')")[1:4]
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
2) Base R This solution uses only base R. We first create year
and iso
variables by breaking up the iso3year
into two parts. ix
is an index into DF
giving the rows having 2012 or 2017 as their year. For each of those rows we find the nearest year having a cata10 value and create a row of the output data frame which lapply
returns as a list of rows, L
. Finally we rbind
those rows together. This is not as straight forward as (1) but does have the advantage of no package dependencies.
to.year <- function(x) as.numeric(substr(x, 5, 8))
year <- to.year(DF$iso3year)
iso <- substr(DF$iso3year, 1, 3)
ix <- which(year %in% c(2012, 2017))
L <- lapply(ix, function(i) {
DF0 <- na.omit(DF[iso[i] == iso & (year[i] - year) %in% c(-2, -1, 1, 2), ])
if (nrow(DF0)) {
with(DF0[which.min(abs(to.year(DF0$iso3year) - year[i])), c("iso3year", "cata10")],
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = as.numeric(substr(iso3year, 5, 8)),
cata10))
} else {
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = NA,
cata10 = NA)
}
})
do.call("rbind", L)
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
3) dplyr/tidyr
First separate iso3year
into iso
and year
columns giving DF2
. Then pick out the 2012 and 2017 rows giving DF3
. Now left join DF3
to DF2
using iso
and get those rows for cata10
in the joined instance that are not NA and the absolute difference in years between the two joined data frames is 1 or 2. Then use slice
to pick out the row having least distance in years and select
out the desired columns giving DF4
Finally left join DF3
with DF4
which will fill out any rows for which there was no match.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
separate(iso3year, c("iso", "year"), remove = FALSE, convert = TRUE)
DF3 <- DF2 %>%
filter(year %in% c(2012, 2017))
DF4 <- DF3 %>%
left_join(DF2, "iso") %>%
drop_na(cata10.y) %>%
filter(abs(year.x - year.y) %in% 1:2) %>%
group_by(iso3year.x) %>%
slice(which.min(abs(year.x - year.y))) %>%
ungroup %>%
select(iso3year = iso3year.x, UHC = UHC.x, year_cata = year.y, cata10 = cata10.y)
DF3 %>%
select(iso3year, UHC) %>%
left_join(DF4, c("iso3year", "UHC"))
giving:
# A tibble: 4 x 4
iso3year UHC year_cata cata10
<chr> <dbl> <int> <dbl>
1 AFG 2012 0.347 2013 14.6
2 AFG 2017 0.395 2016 4.84
3 AGO 2012 0.340 2011 12.4
4 AGO 2017 0.376 2015 16.9
Note
Lines <- "iso3year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA"
DF <- read.csv(text = gsub(" +", ",", Lines), as.is = TRUE)
Find closest value between vectors in data frame by group
Could go for:
library(dplyr)
df %>%
group_by(id) %>%
mutate(
closest_visit = case_when(
visit_id == true_visit ~ true_visit,
TRUE ~ true_visit[sapply(visit_id,
function(x) which.min(abs(x - true_visit)))]
)
)
Output:
# A tibble: 9 x 4
# Groups: id [3]
id visit_id true_visit closest_visit
<chr> <dbl> <dbl> <dbl>
1 a 0 NA 3
2 a 5 3 3
3 a 10 NA 3
4 b 0 0 0
5 b 5 5 5
6 b 10 10 10
7 c 0 1 1
8 c 5 7 7
9 c 10 NA 7
Group By and Locate the Closest Number
We can use a rolling join after creating the combination with unique
elements of 'expiration' from second dataset
library(data.table)
library(tidyr)
df1N <- crossing(df1, expiration = unique(df2$expiration))
setDT(df2)[, Price := strike][df1N, on = .(ticker, expiration, Price), roll = -Inf]
# ticker expiration strike Price
#1: SPY 621 205 200
#2: SPY 719 205 200
#3: AAPL 621 100 100
#4: AAPL 719 100 100
Or do a full_join
and then slice
based on the min
imum abs
olute difference between the 'Price' and 'strike' column after grouping by 'ticker', 'expiration'
library(dplyr)
full_join(df1, df2) %>%
group_by(ticker, expiration) %>%
slice(which.min(abs(Price - strike)))
# A tibble: 4 x 4
# Groups: ticker, expiration [4]
# ticker Price expiration strike
# <fct> <dbl> <dbl> <dbl>
#1 AAPL 100 621 100
#2 AAPL 100 719 100
#3 SPY 200 621 205
#4 SPY 200 719 205
Find closest match, then next closest, between groups until a specified number of matches has been made
I edit the code above that produces df.result
by removing the line top_n(-5, DIFF) %>%
. Now res1
contains all matches of Sample.x
and Sample.y
.
Then I used the res1
in the code below. This probably is not perfect, but what it does is finds the closest Sample.y
match for the first entry of Sample.x
. Then both of these Samples
are filtered from the dataframe. The matching repeats until matches are found for every unique value of Sample.y
. The result might differ depending which match was made first.
fun <- function(df) {
HowMany <- length(unique(df$Sample.y))
i <- 1
MyList_FF <- list()
df_f <- df
while (i <= HowMany){
res1 <- df_f %>%
group_by(grp, VAR, Sample.x) %>%
filter(DIFF == min(DIFF)) %>%
ungroup() %>%
mutate(Rank1 = dense_rank(DIFF))
res2 <- res1 %>% group_by(grp, VAR) %>% filter(rank(Rank1, ties.method="first")==1)
SY <- as.numeric(res2$Sample.y)
SX <- as.numeric(res2$Sample.x)
res3 <- df_f %>% filter(Sample.y != SY) # filter Sample.y
res4 <- res3 %>% filter(Sample.x != SX) # filter Sample.x
df_f <- res4
MyList_FF[[i]] <- res2
i <- i + 1
}
do.call("rbind", MyList_FF) # https://stackoverflow.com/a/55542822/1670053
}
df <- res1
MyResult <- df %>%
dplyr::group_split(grp, VAR) %>%
map_df(fun)
Closest value to a specific column in R
Use max.col(-abs(data[, 3] - data[, -3]))
to find the column positions of the closest values and use this result as part of a matrix to extract desired values from your data. The matrix is returned by cbind
col <- 3
data[, -col][cbind(1:nrow(data),
max.col(-abs(data[, col] - data[, -col])))]
#[1] 24 30 20
Related Topics
R Function Prcomp Fails with Na's Values Even Though Na's Are Allowed
How to Plot Mean and Standard Error in Boxplot in R
How to Increase the Space Between Grouped Bars in Ggplot2
How to Save Output from Ggforce::Facet_Grid_Paginate in Only One PDF
Ellipse Containing Percentage of Given Points in R
Creating a Sankey Diagram Using Networkd3 Package in R
How to Count Occurrences Combinations in Data.Table in R
R Stacked Bar Graph Plotting Geom_Text
Merging Data Frames with Different Number of Rows and Different Columns
Rotate Labels in a Chorddiagram (R Circlize)
Split a Vector into Three Vectors of Unequal Length in R
Keyboard Shortcut for Inserting Roxygen #' Comment Start
Set Upper Limit in Ggplot to Include Label Greater Than the Maximum Value
R Data.Table: Subgroup Weighted Percent of Group