Find closest value above current row that meets condition and apply function for each group
df %>%
group_by(name) %>%
mutate(t = na_if(lag(value * (tag == 'Y' & !is.na(code))), 0)) %>%
fill(t) %>%
mutate(results = t * value)
# A tibble: 8 x 7
# Groups: name [2]
name id tag code value t results
<chr> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 apples 1 X 1 1 NA NA
2 apples 2 Y 1 11 NA NA
3 apples 3 Y NA 4 11 44
4 apples 4 X 1 3 11 33
5 oranges 5 X NA 9 NA NA
6 oranges 6 Y 1 5 NA NA
7 oranges 7 X NA 7 5 35
8 oranges 8 X NA 8 5 40
closest value from all previous rows r
You can try the code below using dist
+ apply
transform(
df,
close_prev = Revenue[apply(`diag<-`(m <- as.matrix(dist(Revenue)), Inf) / upper.tri(m), 2, which.min)]
)
which gives
id Revenue close_prev
1 1 4 4
2 2 7 4
3 3 1 4
4 4 2 1
5 5 13 7
6 6 19 13
7 7 11 13
8 8 17 19
9 9 14 13
10 10 3 4
11 11 18 19
12 12 5 4
13 13 9 7
14 14 16 17
15 15 6 7
16 16 15 14
17 17 12 13
18 18 10 11
19 19 20 19
20 20 8 7
How to find the closest value and return the value of the other column?
Or following what you tried already:
dfdf$b[which.min(abs(index - dfdf$a))]
# [1] 300
As a side note (not sure what your outcome should be if there are two matches):
dfdf<-data.frame(a= c(80,90,105,105,120),
b= c(500,400,300,200,100))
index= 105
dfdf$b[which.min(abs(index - dfdf$a))]
# [1] 300
dfdf[findInterval(index, dfdf$a),"b"]
# [1] 200
One more fun example:
dfdf<-data.frame(a= c(80,90,100,105,120),
b= c(500,400,300,200,100))
index= 95
dfdf$b[which.min(abs(index - dfdf$a))]
# [1] 400
dfdf[findInterval(index, dfdf$a),"b"]
# [1] 400
Find the closest value for a certain year in R
Here are three approaches. The first one is the clearest as it shows that the problem is really an aggregated and filtered self-join and directly models this and automatically handles the edge case mentioned in the comments without additional code. The second one uses a lapply
loop to get the desired effect but it involves more tedious manipulation although it does have the advantage of zero package dependencies. The last one gets around the fact that dplyr lacks complex self joins by performing a left join twice.
1) sqldf Using DF
defined reproducibly in the Note at the end perform a self join such that the difference in years is -2, -1, 1 or 2 and the iso3 codes are the same and cata10 is not NA in matching instance and among those rows we use min(...)
to find the row having the minimum absolute difference in the year. This uses the fact that SQLite has the feature that min(...)
will cause the entire row to be returned that satisfies the minimizing condition. Finally take only the 2012 and 2017 rows. The ability of SQL to directly model the constraints using a complex join allows us to directly model the requirements into code.
library(sqldf)
sqldf("select
a.iso3year iso3year_UHC,
a.UHC,
substr(b.iso3year, 5, 8) year_cata,
b.cata10,
substr(a.iso3year, 5, 8) year,
min(abs(substr(a.iso3year, 5, 8) - substr(b.iso3year, 5, 8))) min_value
from DF a
left join DF b on year - year_cata in (-2, -1, 1, 2) and
substr(a.iso3year, 1, 3) = substr(b.iso3year, 1, 3) and
b.cata10 is not null
group by a.iso3year
having year in ('2012', '2017')")[1:4]
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
2) Base R This solution uses only base R. We first create year
and iso
variables by breaking up the iso3year
into two parts. ix
is an index into DF
giving the rows having 2012 or 2017 as their year. For each of those rows we find the nearest year having a cata10 value and create a row of the output data frame which lapply
returns as a list of rows, L
. Finally we rbind
those rows together. This is not as straight forward as (1) but does have the advantage of no package dependencies.
to.year <- function(x) as.numeric(substr(x, 5, 8))
year <- to.year(DF$iso3year)
iso <- substr(DF$iso3year, 1, 3)
ix <- which(year %in% c(2012, 2017))
L <- lapply(ix, function(i) {
DF0 <- na.omit(DF[iso[i] == iso & (year[i] - year) %in% c(-2, -1, 1, 2), ])
if (nrow(DF0)) {
with(DF0[which.min(abs(to.year(DF0$iso3year) - year[i])), c("iso3year", "cata10")],
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = as.numeric(substr(iso3year, 5, 8)),
cata10))
} else {
data.frame(iso3year_UHC = DF$iso3year[i],
UHC = DF$UHC[i],
year_cata = NA,
cata10 = NA)
}
})
do.call("rbind", L)
giving:
iso3year_UHC UHC year_cata cata10
1 AFG 2012 0.3468012 2013 14.631331
2 AFG 2017 0.3948606 2016 4.837534
3 AGO 2012 0.3400455 2011 12.379809
4 AGO 2017 0.3764945 2015 16.902584
3) dplyr/tidyr
First separate iso3year
into iso
and year
columns giving DF2
. Then pick out the 2012 and 2017 rows giving DF3
. Now left join DF3
to DF2
using iso
and get those rows for cata10
in the joined instance that are not NA and the absolute difference in years between the two joined data frames is 1 or 2. Then use slice
to pick out the row having least distance in years and select
out the desired columns giving DF4
Finally left join DF3
with DF4
which will fill out any rows for which there was no match.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
separate(iso3year, c("iso", "year"), remove = FALSE, convert = TRUE)
DF3 <- DF2 %>%
filter(year %in% c(2012, 2017))
DF4 <- DF3 %>%
left_join(DF2, "iso") %>%
drop_na(cata10.y) %>%
filter(abs(year.x - year.y) %in% 1:2) %>%
group_by(iso3year.x) %>%
slice(which.min(abs(year.x - year.y))) %>%
ungroup %>%
select(iso3year = iso3year.x, UHC = UHC.x, year_cata = year.y, cata10 = cata10.y)
DF3 %>%
select(iso3year, UHC) %>%
left_join(DF4, c("iso3year", "UHC"))
giving:
# A tibble: 4 x 4
iso3year UHC year_cata cata10
<chr> <dbl> <int> <dbl>
1 AFG 2012 0.347 2013 14.6
2 AFG 2017 0.395 2016 4.84
3 AGO 2012 0.340 2011 12.4
4 AGO 2017 0.376 2015 16.9
Note
Lines <- "iso3year UHC cata10
AFG 2010 0.3551409 NA
AFG 2011 0.3496452 NA
AFG 2012 0.3468012 NA
AFG 2013 0.3567721 14.631331
AFG 2014 0.3647436 NA
AFG 2015 0.3717983 NA
AFG 2016 0.3855273 4.837534
AFG 2017 0.3948606 NA
AGO 2011 0.3250651 12.379809
AGO 2012 0.3400455 NA
AGO 2013 0.3397722 NA
AGO 2014 0.3385741 NA
AGO 2015 0.3521086 16.902584
AGO 2016 0.3636765 NA
AGO 2017 0.3764945 NA"
DF <- read.csv(text = gsub(" +", ",", Lines), as.is = TRUE)
R - Obtain the closest smaller and bigger values compared to a reference value
You can calculate the differences of the two columns, then use filter
to get your desired rows. We use min
when the difference is larger than 0 and use max
when the difference is smaller than 0.
library(tidyverse)
my_df %>%
mutate(Diff_date = Date - given_date,
Diff_dummy = Dummy - given_Dummy) %>%
filter((Diff_date == min(Diff_date[Diff_date > 0]) |
Diff_date == max(Diff_date[Diff_date < 0])) &
(Diff_dummy == min(Diff_dummy[Diff_dummy > 0]) |
Diff_dummy == max(Diff_dummy[Diff_dummy < 0])))
Date Dummy Value given_date given_Dummy Diff_date Diff_dummy
1 2022-01-09 0.15 -0.7267048 2022-01-10 0.17 -1 days -0.02
2 2022-01-09 0.20 -1.3682810 2022-01-10 0.17 -1 days 0.03
3 2022-01-11 0.15 1.5757275 2022-01-10 0.17 1 days -0.02
4 2022-01-11 0.20 0.6428993 2022-01-10 0.17 1 days 0.03
Data
my_df <- structure(list(Date = structure(c(18993, 18993, 18993, 18993,
18993, 18993, 18993, 18993, 18993, 18995, 18995, 18995, 18995,
18995, 18995, 18995, 18995, 18995, 18997, 18997, 18997, 18997,
18997, 18997, 18997, 18997, 18997, 18999, 18999, 18999, 18999,
18999, 18999, 18999, 18999, 18999, 19001, 19001, 19001, 19001,
19001, 19001, 19001, 19001, 19001, 19003, 19003, 19003, 19003,
19003, 19003, 19003, 19003, 19003, 19005, 19005, 19005, 19005,
19005, 19005, 19005, 19005, 19005, 19007, 19007, 19007, 19007,
19007, 19007, 19007, 19007, 19007, 19009, 19009, 19009, 19009,
19009, 19009, 19009, 19009, 19009, 19011, 19011, 19011, 19011,
19011, 19011, 19011, 19011, 19011), class = "Date"), Dummy = c(-0.2,
-0.15, -0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1,
-0.05, 0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0,
0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1,
0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2,
-0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15,
-0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1, -0.05,
0, 0.05, 0.1, 0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1,
0.15, 0.2, -0.2, -0.15, -0.1, -0.05, 0, 0.05, 0.1, 0.15, 0.2),
Value = c(1.37095844714667, -0.564698171396089, 0.363128411337339,
0.63286260496104, 0.404268323140999, -0.106124516091484,
1.51152199743894, -0.0946590384130976, 2.01842371387704,
-0.062714099052421, 1.30486965422349, 2.28664539270111, -1.38886070111234,
-0.278788766817371, -0.133321336393658, 0.635950398070074,
-0.284252921416072, -2.65645542090478, -2.44046692857552,
1.32011334573019, -0.306638594078475, -1.78130843398, -0.171917355759621,
1.2146746991726, 1.89519346126497, -0.4304691316062, -0.25726938276893,
-1.76316308519478, 0.460097354831271, -0.639994875960119,
0.455450123241219, 0.704837337228819, 1.03510352196992, -0.608926375407211,
0.50495512329797, -1.71700867907334, -0.784459008379496,
-0.850907594176518, -2.41420764994663, 0.0361226068922556,
0.205998600200254, -0.361057298548666, 0.758163235699517,
-0.726704827076575, -1.36828104441929, 0.432818025888717,
-0.811393176186672, 1.44410126172125, -0.431446202613345,
0.655647883402207, 0.321925265203947, -0.783838940880375,
1.57572751979198, 0.642899305717316, 0.0897606465996057,
0.276550747291463, 0.679288816055271, 0.0898328865790817,
-2.99309008315293, 0.284882953530659, -0.367234642740975,
0.185230564865609, 0.581823727365507, 1.39973682729268, -0.727292059474465,
1.30254263204414, 0.335848119752074, 1.03850609869762, 0.920728568290646,
0.720878162866862, -1.04311893856785, -0.0901863866107067,
0.623518161999544, -0.953523357772344, -0.542828814573857,
0.580996497681682, 0.768178737834591, 0.463767588540167,
-0.885776297409679, -1.09978089864786, 1.51270700980493,
0.257921437532031, 0.0884402291595864, -0.120896537539089,
-1.19432889516053, 0.611996898040387, -0.217139845746521,
-0.182756706331922, 0.93334632857116, 0.821773110508249),
given_date = structure(c(19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002, 19002,
19002, 19002, 19002, 19002), class = "Date"), given_Dummy = c(0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17,
0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17)), class = "data.frame", row.names = c(NA,
-90L))
closest value and data frame index index of all data frame elements of a list
Here is a dplyr
approach. We can generate the list.index
and line.number.in.df
for each dataframe and then bind_rows
them together. Next, slice
the rows where C2 contains the closest value for each number in that vector.
library(dplyr)
test <- list(structure(list(C1 = c(0.2, 0.4, 0.5), C2 = c(2, 3.5, 3.7
), C3 = c(0.3, 4, 5)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(C1 = c(0.1, 0.3, 0.6), C2 = c(3.9, 4.3,
8), C3 = c(3, 5.2, 10)), class = "data.frame", row.names = c(NA,
-3L)), structure(list(C1 = c(0.4, 0.55, 0.8), C2 = c(8.9, 10.3,
14), C3 = c(7, 8.4, 11)), class = "data.frame", row.names = c(NA,
-3L)))
vector <- c(3, 14.4, 7, 0)
test %>%
lapply(tibble::rowid_to_column, "line.number.in.df") %>%
bind_rows(.id = "list.index") %>%
slice(vapply(vector, \(x) which.min(abs(x - C2)), integer(1L)))
Output is
list.index line.number.in.df C1 C2 C3
1 1 2 0.4 3.5 4.0
2 3 3 0.8 14.0 11.0
3 2 3 0.6 8.0 10.0
4 1 1 0.2 2.0 0.3
How would you lookup the next-closest value?
1) Base R - sapply This uses base R. For each component of dt
in data.df
it finds all datetimes greater than it in lookup.df
on the same date and then returns the index of the first. Finally it puts together data.df and the rows of those indexes of lookup.df
.
ix <- sapply(data.df$dt, function(dt) with(lookup.df,
which(ldt >= dt & as.Date(ldt, tz = "") == as.Date(dt, tz = ""))[1]
))
res <- cbind(data.df, lookup.df[ix, ])
rownames(res) <- NULL
giving:
> res
dt v1 ldt lv
1 2020-01-08 11:30:00 1 2020-01-08 11:30:00 2
2 2020-01-10 11:30:00 2 2020-01-10 11:31:00 5
3 2020-01-11 12:30:00 3 <NA> NA
2) Base R - merge This is an alternate base R approach. Append a date column to each input data frame and then merge the two by that column. Remove any row for which the lookup.df date/time is less than the data.df date/time and then take the first row of each set of rows derived from the same original data.df row. That will get the matches except that it will miss the rows that have no matches at all so perform a second merge to get those back.
data.df$date <- as.Date(data.df$dt, tz = "")
lookup.df$date <- as.Date(lookup.df$ldt, tz = "")
m <- merge(data.df, lookup.df, by = "date", all.x = TRUE, all.y = FALSE)
m <- subset(m, dt <= ldt)
m <- m[!duplicated(m[1:3]), ]
merge(data.df[-3], m[-1], by = c("dt", "v1"), all.x = TRUE, all.y = FALSE)
giving:
dt v1 ldt lv
1 2020-01-08 11:30:00 1 2020-01-08 11:30:00 2
2 2020-01-10 11:30:00 2 2020-01-10 11:31:00 5
3 2020-01-11 12:30:00 3 <NA> NA
3) SQL Although the question asked for a base R solution an sql solution has been added in addition here because it provides a particularly straight forward translation of the problem into code as a self join with a complex condition. It performs a left join on the indicated condition and takes the minimum ldt
found over all rows derived from the same row in data.df.
library(sqldf)
data.df$date <- as.Date(data.df$dt, tz = "")
lookup.df$date <- as.Date(lookup.df$ldt, tz = "")
sqldf("select D.dt, D.v1, min(L.ldt) as ldt, L.lv
from [data.df] D left join [lookup.df] L
on D.dt <= L.ldt and D.date == L.date
group by D.rowid")
giving:
dt v1 ldt lv
1 2020-01-08 11:30:00 1 2020-01-08 11:30:00 2
2 2020-01-10 11:30:00 2 2020-01-10 11:31:00 5
3 2020-01-11 12:30:00 3 <NA> NA
Note
There was a problem in the question with fancy quotes that R cannot read so we used this as the input:
data.df <- data.frame(dt = as.POSIXct(c('2020-01-08 11:30:00',
'2020-01-10 11:30:00', '2020-01-11 12:30:00')),
v1=c(1,2,3))
lookup.df <- data.frame(ldt = as.POSIXct(c('2020-01-08 11:29:00',
'2020-01-08 11:30:00', '2020-01-08 11:31:00', '2020-01-10 10:30:00',
'2020-01-10 11:31:00', '2020-01-11 11:30:00', '2020-01-12 11:30:00')),
lv = 1:7)
Find Closest Matching Value in Data frame and Return Index
This can be done with findInterval
, but make sure that the dataset is created correctly, cbind
returnss a matrix
and a matrix
can have only a single class. Wrapping with data.frame
propagate the same class to either factor
or character
depending on stringsAsFactors = TRUE/FALSE
(if there is at least one character
element).
Order the 'Grades' dataset by the 'score' column and apply findInterval
to get the index of nearest matching value and use that for extracting the 'grade'
Scores <- data.frame(a,b,c,score)
Grades <- data.frame(score,grade)
Grades1 <- Grades[order(Grades$score),]
Scores$Grade <- Grades1$grade[findInterval(Scores$score, Grades1$score) +1]
or another option is a rolling join
library(data.table)
setDT(Scores)[Grades, Grade := grade, on = .(score), roll = Inf]
Scores
# a b c score Grade
#1: 1.2 2.3 3.4 6.9 C
#2: 2.3 3.4 4.5 10.2 B
#3: 3.4 4.5 5.6 13.5 A
Closest value to a specific column in R
Use max.col(-abs(data[, 3] - data[, -3]))
to find the column positions of the closest values and use this result as part of a matrix to extract desired values from your data. The matrix is returned by cbind
col <- 3
data[, -col][cbind(1:nrow(data),
max.col(-abs(data[, col] - data[, -col])))]
#[1] 24 30 20
Related Topics
Arrange Within a Group with Dplyr
How to Flatten The Data of Different Data Types by Using Sparklyr Package
Creating Categorical Variables from Mutually Exclusive Dummy Variables
Ggplot: Recommended Colour Palettes Also Distinguishable for B&W Printing
Standard Deviation on Dataframe Does Not Work
Create New Variable by Multiple Conditions via Mutate Case_When
Xaringan Slide Separator Not Separating Slides
Tidyr Spread Function Generates Sparse Matrix When Compact Vector Expected
Is There an Equivalent in Ggplot to The Varwidth Option in Plot
How to Plot Multiple Lines in R
Function to Change Blanks to Na
Split Violin Plot with Ggplot2 with Quantiles
Using: = in Data.Table with Paste()
Remove Certain Words in String from Column in Dataframe in R