Return Rows Establishing a "Closest Value To" in R

Find closest value above current row that meets condition and apply function for each group

df %>% 
group_by(name) %>%
mutate(t = na_if(lag(value * (tag == 'Y' & !is.na(code))), 0)) %>%
fill(t) %>%
mutate(results = t * value)

# A tibble: 8 x 7
# Groups: name [2]
name id tag code value t results
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 apples 1 X 1 1 NA NA
2 apples 2 Y 1 11 NA NA
3 apples 3 Y NA 4 11 44
4 apples 4 X 1 3 11 33
5 oranges 5 X NA 9 NA NA
6 oranges 6 Y 1 5 NA NA
7 oranges 7 X NA 7 5 35
8 oranges 8 X NA 8 5 40

closest value from all previous rows r

You can try the code below using dist + apply

transform(
df,
close_prev = Revenue[apply(`diag<-`(m <- as.matrix(dist(Revenue)), Inf) / upper.tri(m), 2, which.min)]
)

which gives

   id Revenue close_prev
1 1 4 4
2 2 7 4
3 3 1 4
4 4 2 1
5 5 13 7
6 6 19 13
7 7 11 13
8 8 17 19
9 9 14 13
10 10 3 4
11 11 18 19
12 12 5 4
13 13 9 7
14 14 16 17
15 15 6 7
16 16 15 14
17 17 12 13
18 18 10 11
19 19 20 19
20 20 8 7

How to find the closest value and return the value of the other column?

Or following what you tried already:

dfdf$b[which.min(abs(index - dfdf$a))]
# [1] 300

As a side note (not sure what your outcome should be if there are two matches):

dfdf<-data.frame(a= c(80,90,105,105,120),
b= c(500,400,300,200,100))
index= 105

dfdf$b[which.min(abs(index - dfdf$a))]
# [1] 300

dfdf[findInterval(index, dfdf$a),"b"]
# [1] 200

One more fun example:

dfdf<-data.frame(a= c(80,90,100,105,120),
b= c(500,400,300,200,100))
index= 95

dfdf$b[which.min(abs(index - dfdf$a))]
# [1] 400

dfdf[findInterval(index, dfdf$a),"b"]
# [1] 400

Find the closest value for a certain year in R

Here are three approaches. The first one is the clearest as it shows that the problem is really an aggregated and filtered self-join and directly models this and automatically handles the edge case mentioned in the comments without additional code. The second one uses a lapply loop to get the desired effect but it involves more tedious manipulation although it does have the advantage of zero package dependencies. The last one gets around the fact that dplyr lacks complex self joins by performing a left join twice.

1) sqldf Using DF defined reproducibly in the Note at the end perform a self join such that the difference in years is -2, -1, 1 or 2 and the iso3 codes are the same and cata10 is not NA in matching instance and among those rows we use min(...) to find the row having the minimum absolute difference in the year. This uses the fact that SQLite has the feature that min(...) will cause the entire row to be returned that satisfies the minimizing condition. Finally take only the 2012 and 2017 rows. The ability of SQL to directly model the constraints using a complex join allows us to directly model the requirements into code.

library(sqldf)

sqldf("select
a.iso3year iso3year_UHC,
a.UHC,
substr(b.iso3year, 5, 8) year_cata,
b.cata10,
substr(a.iso3year, 5, 8) year,
min(abs(substr(a.iso3year, 5, 8) - substr(b.iso3year, 5, 8))) min_value
from DF a
left join DF b on year - year_cata in (-2, -1, 1, 2) and
substr(a.iso3year, 1, 3) = substr(b.iso3year, 1, 3) and
b.cata10 is not null
group by a.iso3year
having year in ('2012', '2017')")[1:4]

giving:

  iso3year_UHC       UHC year_cata    cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584

2) Base R This solution uses only base R. We first create year and iso variables by breaking up the iso3year into two parts. ix is an index into DF giving the rows having 2012 or 2017 as their year. For each of those rows we find the nearest year having a cata10 value and create a row of the output data frame which lapply returns as a list of rows, L. Finally we rbind those rows together. This is not as straight forward as (1) but does have the advantage of no package dependencies.

to.year <- function(x) as.numeric(substr(x, 5, 8))
year <- to.year(DF$iso3year)
iso <- substr(DF$iso3year, 1, 3)
ix <- which(year %in% c(2012, 2017))
L <- lapply(ix, function(i) {
DF0 <- na.omit(DF[iso[i] == iso & (year[i] - year) %in% c(-2, -1, 1, 2), ])
if (nrow(DF0)) {
with(DF0[which.min(abs(to.year(DF0$iso3year) - year[i])), c("iso3year", "cata10")],
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = as.numeric(substr(iso3year, 5, 8)),
cata10))
} else {
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = NA,
cata10 = NA)
}
})
do.call("rbind", L)

giving:

  iso3year_UHC       UHC year_cata    cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584

3) dplyr/tidyr

First separate iso3year into iso and year columns giving DF2. Then pick out the 2012 and 2017 rows giving DF3. Now left join DF3 to DF2 using iso and get those rows for cata10 in the joined instance that are not NA and the absolute difference in years between the two joined data frames is 1 or 2. Then use slice to pick out the row having least distance in years and select out the desired columns giving DF4 Finally left join DF3 with DF4 which will fill out any rows for which there was no match.

library(dplyr)
library(tidyr)

DF2 <- DF %>%
separate(iso3year, c("iso", "year"), remove = FALSE, convert = TRUE)

DF3 <- DF2 %>%
filter(year %in% c(2012, 2017))

DF4 <- DF3 %>%
left_join(DF2, "iso") %>%
drop_na(cata10.y) %>%
filter(abs(year.x - year.y) %in% 1:2) %>%
group_by(iso3year.x) %>%
slice(which.min(abs(year.x - year.y))) %>%
ungroup %>%
select(iso3year = iso3year.x, UHC = UHC.x, year_cata = year.y, cata10 = cata10.y)

DF3 %>%
select(iso3year, UHC) %>%
left_join(DF4, c("iso3year", "UHC"))

giving:

# A tibble: 4 x 4
iso3year UHC year_cata cata10
<chr> <dbl> <int> <dbl>
1 AFG 2012 0.347 2013 14.6
2 AFG 2017 0.395 2016 4.84
3 AGO 2012 0.340 2011 12.4
4 AGO 2017 0.376 2015 16.9

Note

Lines <- "iso3year    UHC         cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA"
DF <- read.csv(text = gsub(" +", ",", Lines), as.is = TRUE)

R - Obtain the closest smaller and bigger values compared to a reference value

You can calculate the differences of the two columns, then use filter to get your desired rows. We use min when the difference is larger than 0 and use max when the difference is smaller than 0.

library(tidyverse)

my_df %>%
mutate(Diff_date = Date - given_date,
Diff_dummy = Dummy - given_Dummy) %>%
filter((Diff_date == min(Diff_date[Diff_date > 0]) |
Diff_date == max(Diff_date[Diff_date < 0])) &
(Diff_dummy == min(Diff_dummy[Diff_dummy > 0]) |
Diff_dummy == max(Diff_dummy[Diff_dummy < 0])))

Date Dummy Value given_date given_Dummy Diff_date Diff_dummy
1 2022-01-09 0.15 -0.7267048 2022-01-10 0.17 -1 days -0.02
2 2022-01-09 0.20 -1.3682810 2022-01-10 0.17 -1 days 0.03
3 2022-01-11 0.15 1.5757275 2022-01-10 0.17 1 days -0.02
4 2022-01-11 0.20 0.6428993 2022-01-10 0.17 1 days 0.03

Data

my_df <- structure(list(Date = structure(c(18993, 18993, 18993, 18993, 
18993, 18993, 18993, 18993, 18993, 18995, 18995, 18995, 18995,
18995, 18995, 18995, 18995, 18995, 18997, 18997, 18997, 18997,
18997, 18997, 18997, 18997, 18997, 18999, 18999, 18999, 18999,
18999, 18999, 18999, 18999, 18999, 19001, 19001, 19001, 19001,
19001, 19001, 19001, 19001, 19001, 19003, 19003, 19003, 19003,
19003, 19003, 19003, 19003, 19003, 19005, 19005, 19005, 19005,
19005, 19005, 19005, 19005, 19005, 19007, 19007, 19007, 19007,
19007, 19007, 19007, 19007, 19007, 19009, 19009, 19009, 19009,
19009, 19009, 19009, 19009, 19009, 19011, 19011, 19011, 19011,
19011, 19011, 19011, 19011, 19011), class = "Date"), Dummy = c(-0.2,
-0.15, -0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1,
-0.05, 0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0,
0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1,
0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2,
-0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15,
-0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1, -0.05,
0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1,
0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2),
Value = c(1.37095844714667, -0.564698171396089, 0.363128411337339,
0.63286260496104, 0.404268323140999, -0.106124516091484,
1.51152199743894, -0.0946590384130976, 2.01842371387704,
-0.062714099052421, 1.30486965422349, 2.28664539270111, -1.38886070111234,
-0.278788766817371, -0.133321336393658, 0.635950398070074,
-0.284252921416072, -2.65645542090478, -2.44046692857552,
1.32011334573019, -0.306638594078475, -1.78130843398, -0.171917355759621,
1.2146746991726, 1.89519346126497, -0.4304691316062, -0.25726938276893,
-1.76316308519478, 0.460097354831271, -0.639994875960119,
0.455450123241219, 0.704837337228819, 1.03510352196992, -0.608926375407211,
0.50495512329797, -1.71700867907334, -0.784459008379496,
-0.850907594176518, -2.41420764994663, 0.0361226068922556,
0.205998600200254, -0.361057298548666, 0.758163235699517,
-0.726704827076575, -1.36828104441929, 0.432818025888717,
-0.811393176186672, 1.44410126172125, -0.431446202613345,
0.655647883402207, 0.321925265203947, -0.783838940880375,
1.57572751979198, 0.642899305717316, 0.0897606465996057,
0.276550747291463, 0.679288816055271, 0.0898328865790817,
-2.99309008315293, 0.284882953530659, -0.367234642740975,
0.185230564865609, 0.581823727365507, 1.39973682729268, -0.727292059474465,
1.30254263204414, 0.335848119752074, 1.03850609869762, 0.920728568290646,
0.720878162866862, -1.04311893856785, -0.0901863866107067,
0.623518161999544, -0.953523357772344, -0.542828814573857,
0.580996497681682, 0.768178737834591, 0.463767588540167,
-0.885776297409679, -1.09978089864786, 1.51270700980493,
0.257921437532031, 0.0884402291595864, -0.120896537539089,
-1.19432889516053, 0.611996898040387, -0.217139845746521,
-0.182756706331922, 0.93334632857116, 0.821773110508249),
given_date = structure(c(19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002), class = "Date"), given_Dummy = c(0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17)), class = "data.frame", row.names = c(NA,
-90L))

closest value and data frame index index of all data frame elements of a list

Here is a dplyr approach. We can generate the list.index and line.number.in.df for each dataframe and then bind_rows them together. Next, slice the rows where C2 contains the closest value for each number in that vector.

library(dplyr)

test <- list(structure(list(C1 = c(0.2, 0.4, 0.5), C2 = c(2, 3.5, 3.7
), C3 = c(0.3, 4, 5)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(C1 = c(0.1, 0.3, 0.6), C2 = c(3.9, 4.3,
8), C3 = c(3, 5.2, 10)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(C1 = c(0.4, 0.55, 0.8), C2 = c(8.9, 10.3,
14), C3 = c(7, 8.4, 11)), class = "data.frame", row.names = c(NA,
-3L)))

vector <- c(3, 14.4, 7, 0)

test %>%
lapply(tibble::rowid_to_column, "line.number.in.df") %>%
bind_rows(.id = "list.index") %>%
slice(vapply(vector, \(x) which.min(abs(x - C2)), integer(1L)))

Output is

  list.index line.number.in.df  C1   C2   C3
1 1 2 0.4 3.5 4.0
2 3 3 0.8 14.0 11.0
3 2 3 0.6 8.0 10.0
4 1 1 0.2 2.0 0.3

How would you lookup the next-closest value?

1) Base R - sapply This uses base R. For each component of dt in data.df it finds all datetimes greater than it in lookup.df on the same date and then returns the index of the first. Finally it puts together data.df and the rows of those indexes of lookup.df .

ix <- sapply(data.df$dt, function(dt) with(lookup.df, 
which(ldt >= dt & as.Date(ldt, tz = "") == as.Date(dt, tz = ""))[1]
))
res <- cbind(data.df, lookup.df[ix, ])
rownames(res) <- NULL

giving:

> res
dt v1 ldt lv
1 2020-01-08 11:30:00 1 2020-01-08 11:30:00 2
2 2020-01-10 11:30:00 2 2020-01-10 11:31:00 5
3 2020-01-11 12:30:00 3 <NA> NA

2) Base R - merge This is an alternate base R approach. Append a date column to each input data frame and then merge the two by that column. Remove any row for which the lookup.df date/time is less than the data.df date/time and then take the first row of each set of rows derived from the same original data.df row. That will get the matches except that it will miss the rows that have no matches at all so perform a second merge to get those back.

data.df$date <- as.Date(data.df$dt, tz = "")
lookup.df$date <- as.Date(lookup.df$ldt, tz = "")

m <- merge(data.df, lookup.df, by = "date", all.x = TRUE, all.y = FALSE)
m <- subset(m, dt <= ldt)
m <- m[!duplicated(m[1:3]), ]
merge(data.df[-3], m[-1], by = c("dt", "v1"), all.x = TRUE, all.y = FALSE)

giving:

                   dt v1                 ldt lv
1 2020-01-08 11:30:00 1 2020-01-08 11:30:00 2
2 2020-01-10 11:30:00 2 2020-01-10 11:31:00 5
3 2020-01-11 12:30:00 3 <NA> NA

3) SQL Although the question asked for a base R solution an sql solution has been added in addition here because it provides a particularly straight forward translation of the problem into code as a self join with a complex condition. It performs a left join on the indicated condition and takes the minimum ldt found over all rows derived from the same row in data.df.

library(sqldf)

data.df$date <- as.Date(data.df$dt, tz = "")
lookup.df$date <- as.Date(lookup.df$ldt, tz = "")

sqldf("select D.dt, D.v1, min(L.ldt) as ldt, L.lv
from [data.df] D left join [lookup.df] L
on D.dt <= L.ldt and D.date == L.date
group by D.rowid")

giving:

                   dt v1                 ldt lv
1 2020-01-08 11:30:00 1 2020-01-08 11:30:00 2
2 2020-01-10 11:30:00 2 2020-01-10 11:31:00 5
3 2020-01-11 12:30:00 3 <NA> NA

Note

There was a problem in the question with fancy quotes that R cannot read so we used this as the input:

data.df <- data.frame(dt = as.POSIXct(c('2020-01-08 11:30:00', 
'2020-01-10 11:30:00', '2020-01-11 12:30:00')),
v1=c(1,2,3))

lookup.df <- data.frame(ldt = as.POSIXct(c('2020-01-08 11:29:00',
'2020-01-08 11:30:00', '2020-01-08 11:31:00', '2020-01-10 10:30:00',
'2020-01-10 11:31:00', '2020-01-11 11:30:00', '2020-01-12 11:30:00')),
lv = 1:7)

Find Closest Matching Value in Data frame and Return Index

This can be done with findInterval, but make sure that the dataset is created correctly, cbind returnss a matrix and a matrix can have only a single class. Wrapping with data.frame propagate the same class to either factor or character depending on stringsAsFactors = TRUE/FALSE (if there is at least one character element).

Order the 'Grades' dataset by the 'score' column and apply findInterval to get the index of nearest matching value and use that for extracting the 'grade'

Scores <- data.frame(a,b,c,score)
Grades <- data.frame(score,grade)
Grades1 <- Grades[order(Grades$score),]
Scores$Grade <- Grades1$grade[findInterval(Scores$score, Grades1$score) +1]

or another option is a rolling join

library(data.table)
setDT(Scores)[Grades, Grade := grade, on = .(score), roll = Inf]
Scores
# a b c score Grade
#1: 1.2 2.3 3.4 6.9 C
#2: 2.3 3.4 4.5 10.2 B
#3: 3.4 4.5 5.6 13.5 A

Closest value to a specific column in R

Use max.col(-abs(data[, 3] - data[, -3])) to find the column positions of the closest values and use this result as part of a matrix to extract desired values from your data. The matrix is returned by cbind

col <- 3
data[, -col][cbind(1:nrow(data),
max.col(-abs(data[, col] - data[, -col])))]
#[1] 24 30 20


Related Topics



Leave a reply



Submit