Find Value Closest to X by Group in Dplyr

find value closest to x by group in dplyr

which.min() returns the index of the (first) minimum or maximum of a numeric (or logical) vector. If there are multiple equal values as close to 1.43 as each other and you want to keep all of them, you can use filter():

a %>% group_by(id) %>% filter(abs(b - 1.43) == min(abs(b - 1.43)))

#Source: local data frame [2 x 2]
#Groups: id [2]

#     id     b
#  <chr> <dbl>
#1     A   1.5
#2     B   1.4

If you prefer sticking with the nth() function, and it is OK to have only one value for each group, you can wrap it within a summarize function so that it will be applied to each group, and also according to ?nth(), you need to pass the vector to the function as an argument as well:

a %>% group_by(id) %>% summarise(b = nth(b, which.min(abs(b-1.43))))

# A tibble: 2 × 2
#     id     b
#  <chr> <dbl>
#1     A   1.5
#2     B   1.4

In R: find the closest value within group_by excluding self comparisons

Answered it using a question I asked years ago Count values less than x and find nearest values to x by multiple groups

temp1 <- data%>%
  group_by(river) %>%
  mutate(n_ds = match(dist,sort(dist))-1) %>%
  mutate(closest_uid=apply(sapply(dist, function(i)abs(i-dist)), 2, function(n) id[which(n==sort(n)[2])])) %>%
  data.frame()

tempdist <- temp1 %>% select(dist, id) %>% rename(rivDist = dist)

temp2 <- temp1 %>% left_join(tempdist, by = c('closest_uid' = 'id')) %>%
  mutate(mindist = abs(dist - rivDist)

Find the nearest value in a column of grouped data, and then their corresponding rows in R

The which should be inside the summarise

library(dplyr)
df %>%
      group_by(treat) %>%
      summarise(i = which.min(abs(x - my.val)))

Or if we have multiple element in 'my.val'

library(purrr)
df %>%
    group_by(treat) %>%
    summarise(i = map_int(my.val, ~ which.min(abs(x - .x))))

Or may use findInterval as well

df %>%
    group_by(treat) %>% 
    summarise(i = findInterval(my.val, x))

Find the closest value in the group for each value in the group R

Are you after something like below?

setDT(df)[
  ,
  c(
    .SD[sex == "F"],
    .(closestM_id = id[sex == "M"][max.col(-abs(outer(
      time[sex == "F"],
      time[sex == "M"], "-"
    )))])
  ), group
]

which gives

   group id sex time closestM_id
1:     1  2   F 11.0           3
2:     2  6   F 15.0           5
3:     2  7   F  9.0           5
4:     2  8   F  7.4           5
5:     3 12   F 21.0           9

Data

> dput(df)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
    group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3), sex = c("M",
    "F", "M", "M", "M", "F", "F", "F", "M", "M", "M", "F"), time = c(10,
    11, 11.5, 13, 13.2, 15, 9, 7.4, 18, 12, 34.5, 21)), class = "data.frame", row.names = c(NA,
-12L))

Find the closest value for a certain year in R

Here are three approaches. The first one is the clearest as it shows that the problem is really an aggregated and filtered self-join and directly models this and automatically handles the edge case mentioned in the comments without additional code. The second one uses a lapply loop to get the desired effect but it involves more tedious manipulation although it does have the advantage of zero package dependencies. The last one gets around the fact that dplyr lacks complex self joins by performing a left join twice.

1) sqldf Using DF defined reproducibly in the Note at the end perform a self join such that the difference in years is -2, -1, 1 or 2 and the iso3 codes are the same and cata10 is not NA in matching instance and among those rows we use min(...) to find the row having the minimum absolute difference in the year. This uses the fact that SQLite has the feature that min(...) will cause the entire row to be returned that satisfies the minimizing condition. Finally take only the 2012 and 2017 rows. The ability of SQL to directly model the constraints using a complex join allows us to directly model the requirements into code.

library(sqldf)

sqldf("select 
      a.iso3year iso3year_UHC, 
      a.UHC, 
      substr(b.iso3year, 5, 8) year_cata, 
      b.cata10, 
      substr(a.iso3year, 5, 8) year, 
      min(abs(substr(a.iso3year, 5, 8) - substr(b.iso3year, 5, 8))) min_value 
    from DF a  
    left join DF b on year - year_cata in (-2, -1, 1, 2) and
      substr(a.iso3year, 1, 3) = substr(b.iso3year, 1, 3) and
      b.cata10 is not null
    group by a.iso3year
    having year in ('2012', '2017')")[1:4]

giving:

  iso3year_UHC       UHC year_cata    cata10
1     AFG 2012 0.3468012      2013 14.631331
2     AFG 2017 0.3948606      2016  4.837534
3     AGO 2012 0.3400455      2011 12.379809
4     AGO 2017 0.3764945      2015 16.902584

2) Base R This solution uses only base R. We first create year and iso variables by breaking up the iso3year into two parts. ix is an index into DF giving the rows having 2012 or 2017 as their year. For each of those rows we find the nearest year having a cata10 value and create a row of the output data frame which lapply returns as a list of rows, L. Finally we rbind those rows together. This is not as straight forward as (1) but does have the advantage of no package dependencies.

to.year <- function(x) as.numeric(substr(x, 5, 8))
year <- to.year(DF$iso3year)
iso <- substr(DF$iso3year, 1, 3)
ix <- which(year %in% c(2012, 2017))
L <- lapply(ix, function(i) {
  DF0 <- na.omit(DF[iso[i] == iso & (year[i] - year) %in% c(-2, -1, 1, 2), ])
  if (nrow(DF0)) {
    with(DF0[which.min(abs(to.year(DF0$iso3year) - year[i])), c("iso3year", "cata10")], 
      data.frame(iso3year_UHC = DF$iso3year[i], 
               UHC = DF$UHC[i], 
               year_cata = as.numeric(substr(iso3year, 5, 8)), 
               cata10))
  } else {
      data.frame(iso3year_UHC = DF$iso3year[i], 
               UHC = DF$UHC[i], 
               year_cata = NA,
               cata10 = NA)
  }
})
do.call("rbind", L)

giving:

  iso3year_UHC       UHC year_cata    cata10
1     AFG 2012 0.3468012      2013 14.631331
2     AFG 2017 0.3948606      2016  4.837534
3     AGO 2012 0.3400455      2011 12.379809
4     AGO 2017 0.3764945      2015 16.902584

3) dplyr/tidyr

First separate iso3year into iso and year columns giving DF2. Then pick out the 2012 and 2017 rows giving DF3. Now left join DF3 to DF2 using iso and get those rows for cata10 in the joined instance that are not NA and the absolute difference in years between the two joined data frames is 1 or 2. Then use slice to pick out the row having least distance in years and select out the desired columns giving DF4 Finally left join DF3 with DF4 which will fill out any rows for which there was no match.

library(dplyr)
library(tidyr)

DF2 <- DF %>%
  separate(iso3year, c("iso", "year"), remove = FALSE, convert = TRUE)

DF3 <- DF2 %>%
  filter(year %in% c(2012, 2017))

DF4 <- DF3 %>%
  left_join(DF2, "iso") %>%
  drop_na(cata10.y) %>%
  filter(abs(year.x - year.y) %in% 1:2) %>%
  group_by(iso3year.x) %>%
  slice(which.min(abs(year.x - year.y))) %>%
  ungroup %>%
  select(iso3year = iso3year.x, UHC = UHC.x, year_cata = year.y, cata10 = cata10.y)

DF3 %>% 
  select(iso3year, UHC) %>%
  left_join(DF4,  c("iso3year", "UHC"))

giving:

# A tibble: 4 x 4
  iso3year   UHC year_cata cata10
  <chr>    <dbl>     <int>  <dbl>
1 AFG 2012 0.347      2013  14.6 
2 AFG 2017 0.395      2016   4.84
3 AGO 2012 0.340      2011  12.4 
4 AGO 2017 0.376      2015  16.9

Note

Lines <- "iso3year    UHC         cata10
AFG 2010    0.3551409   NA
AFG 2011    0.3496452   NA
AFG 2012    0.3468012   NA
AFG 2013    0.3567721   14.631331
AFG 2014    0.3647436   NA
AFG 2015    0.3717983   NA
AFG 2016    0.3855273   4.837534
AFG 2017    0.3948606   NA
AGO 2011    0.3250651   12.379809
AGO 2012    0.3400455   NA
AGO 2013    0.3397722   NA
AGO 2014    0.3385741   NA
AGO 2015    0.3521086   16.902584
AGO 2016    0.3636765   NA
AGO 2017    0.3764945   NA"
DF <- read.csv(text = gsub("  +", ",", Lines), as.is = TRUE)

Find closest value between vectors in data frame by group

Could go for:

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(
    closest_visit = case_when(
      visit_id == true_visit ~ true_visit,
      TRUE ~ true_visit[sapply(visit_id, 
                               function(x) which.min(abs(x - true_visit)))]
      )
    )

Output:

# A tibble: 9 x 4
# Groups:   id [3]
  id    visit_id true_visit closest_visit
  <chr>    <dbl>      <dbl>         <dbl>
1 a            0         NA             3
2 a            5          3             3
3 a           10         NA             3
4 b            0          0             0
5 b            5          5             5
6 b           10         10            10
7 c            0          1             1
8 c            5          7             7
9 c           10         NA             7

Group By and Locate the Closest Number

We can use a rolling join after creating the combination with unique elements of 'expiration' from second dataset

library(data.table)
library(tidyr)
df1N <- crossing(df1, expiration = unique(df2$expiration))
setDT(df2)[, Price := strike][df1N, on = .(ticker, expiration, Price), roll = -Inf]
#    ticker expiration strike Price
#1:    SPY        621    205   200
#2:    SPY        719    205   200
#3:   AAPL        621    100   100
#4:   AAPL        719    100   100

Or do a full_join and then slice based on the minimum absolute difference between the 'Price' and 'strike' column after grouping by 'ticker', 'expiration'

library(dplyr)
full_join(df1, df2) %>% 
    group_by(ticker, expiration) %>% 
    slice(which.min(abs(Price - strike)))
# A tibble: 4 x 4
# Groups:   ticker, expiration [4]
#  ticker Price expiration strike
#  <fct>  <dbl>      <dbl>  <dbl>
#1 AAPL     100        621    100
#2 AAPL     100        719    100
#3 SPY      200        621    205
#4 SPY      200        719    205

Find closest match, then next closest, between groups until a specified number of matches has been made

I edit the code above that produces df.result by removing the line top_n(-5, DIFF) %>%. Now res1 contains all matches of Sample.x and Sample.y.

Then I used the res1 in the code below. This probably is not perfect, but what it does is finds the closest Sample.y match for the first entry of Sample.x. Then both of these Samples are filtered from the dataframe. The matching repeats until matches are found for every unique value of Sample.y. The result might differ depending which match was made first.

  fun <- function(df) {
  HowMany <- length(unique(df$Sample.y))
  i <- 1
  MyList_FF <- list()
  df_f <- df
  while (i <= HowMany){
    res1 <- df_f %>%
      group_by(grp, VAR, Sample.x) %>%
      filter(DIFF == min(DIFF)) %>%
      ungroup() %>%
      mutate(Rank1 = dense_rank(DIFF))

    res2 <- res1 %>% group_by(grp, VAR) %>% filter(rank(Rank1, ties.method="first")==1)

    SY <- as.numeric(res2$Sample.y)
    SX <- as.numeric(res2$Sample.x)
    res3 <- df_f %>% filter(Sample.y != SY) # filter Sample.y
    res4 <- res3 %>% filter(Sample.x != SX) # filter Sample.x
    df_f <- res4

    MyList_FF[[i]] <- res2

    i <- i + 1
  }
  do.call("rbind", MyList_FF) # https://stackoverflow.com/a/55542822/1670053
}

df <- res1
MyResult <- df %>%
  dplyr::group_split(grp, VAR) %>%
  map_df(fun)

Closest value to a specific column in R

Use max.col(-abs(data[, 3] - data[, -3])) to find the column positions of the closest values and use this result as part of a matrix to extract desired values from your data. The matrix is returned by cbind

col <- 3
data[, -col][cbind(1:nrow(data),
                   max.col(-abs(data[, col] - data[, -col])))]
#[1] 24 30 20

Find Value Closest to X by Group in Dplyr