Fuzzy Left Join with R

fuzzy LEFT join with R

Voila :)

fuzzy_left_join(df1, df2, match_fun = ci_str_detect, by = c(col1 = "col4"))

Fuzzy Left Join exact + partial string match

The function applied in match_fun doesn't apply to one combination at a time. It applies the function to all combinations so you need to change the detect function :

detect <- function(x, y){ 
mapply(function(x, y) any(x == y), strsplit(x, '/'), strsplit(y, '/'))
}

and then try :

fuzzyjoin::fuzzy_left_join(x, y, by = c("X1" = "Y1", "X2" = "Y2", "Names"),
match_fun = list(`==`, `==`, detect))

# X1 X2 Names.x Y1 Y2 Names.y
# <chr> <chr> <chr> <chr> <chr> <chr>
#1 5000 a a/b/c 5000 a a/j/k
#2 6000 b d/e/f NA NA NA
#3 7000 c g/h/i NA NA NA

R fuzzy_left_join with time

I ran your code and it ran without error. While the result was NAs, I fixed one thing with the last two list items

match_fun = list(`==`, `==`, `>=`, `<=`)

and got your desired result!

library(fuzzyjoin, quietly = TRUE); library(anytime, quietly = TRUE); library(hms, quietly = TRUE)
#> Warning: package 'fuzzyjoin' was built under R version 3.6.3
#> Warning: package 'anytime' was built under R version 3.6.3
calls.sample <- data.frame(ccode = c("MMM", "K", "A", "CAG", "PM"),
Date = c(20111020, 20111021, 20120102, 20110510, 20080710),
Time = c("09:30:00", "14:30:00", "11:00:00", "15:30:00", "13:00:00"),
Timeint = c("9:28:00", "14:28:00", "10:58:00", "15:28:00", "12:58:00"))
calls.sample$Time <- as_hms(as.character(calls.sample$Time))
calls.sample$Timeint <- as_hms(as.character(calls.sample$Timeint))
stocks.sample <- data.frame(Ticker = c("MMM", "K", "A", "CAG", "PM"),
Date = c(20111020, 20111021, 20120102, 20110510, 20080710),
Timestamp = c("9:28:00", "14:30:00", "11:00:00", "15:30:00", "13:00:00"),
OpenPrice = c(5, 1,6,7,8))
stocks.sample$Timestamp <- as_hms(as.character(stocks.sample$Timestamp))

fuzzy_left_join(calls.sample, stocks.sample,
by = c("ccode" = "Ticker",
"Date" = "Date",
"Time" = "Timestamp",
"Timeint" = "Timestamp"),
match_fun = list(`==`, `==`, `>=`, `<=`))
#> ccode Date.x Time Timeint Ticker Date.y Timestamp OpenPrice
#> 1 MMM 20111020 09:30:00 09:28:00 MMM 20111020 09:28:00 5
#> 2 K 20111021 14:30:00 14:28:00 K 20111021 14:30:00 1
#> 3 A 20120102 11:00:00 10:58:00 A 20120102 11:00:00 6
#> 4 CAG 20110510 15:30:00 15:28:00 CAG 20110510 15:30:00 7
#> 5 PM 20080710 13:00:00 12:58:00 PM 20080710 13:00:00 8

Created on 2020-10-20 by the reprex package (v0.3.0)

Fuzzy Join with Partial String Match in R

We can use fuzzyjoin. Do a regex_left_join after getting the substring from the 'course' columns in both dataset (to make it more matchable)

library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>%
mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>%
regex_left_join(df1 %>%
mutate(grp = toupper(str_remove(course,
"\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>%
select(student_id = student_id.x, course, grade)

-output

# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C

OP's expected output is

 df_final
# A tibble: 9 x 3
student_id course grade
<chr> <chr> <chr>
1 001 5th Social Studies A
2 001 5th ELA A
3 001 5th Mathematics A
4 002 6th Social Studies B
5 002 6th ELA B
6 002 6th Mathematics B
7 003 8th Social Studies C
8 003 8th ELA C
9 003 8th Mathematics C

fuzzyjoin with dates in R

There are three issues

  1. Replace the double quote with backquote inside the match_fun

  2. the by values should be reversed

  3. 'date' columns are changed to respective Date class


library(fuzzyjoin)
library(dplyr)
individual_data$date <- as.Date(individual_data$date)
match_data$match_date_minus3 <- as.Date(match_data$match_date_minus3)
match_data$match_date_plus3 <- as.Date(match_data$match_date_plus3)
fuzzy_left_join(individual_data, match_data,
by = c("country" = "country",
'date' = "match_date_minus3",
'date' = "match_date_plus3"),
match_fun = list(`==`, `>`, `<`)) %>%
select(country = country.x, date = date.x, outcome,
opponent, match_outcome, match_date = date.y)
# country date outcome opponent match_outcome match_date
#1 Country A 2000-01-01 1.4003662 Country B L 2000-01-02
#2 Country A 2000-01-02 0.5526607 Country B L 2000-01-02
#3 Country A 2000-01-03 0.4316405 Country B L 2000-01-02
#4 Country A 2000-01-04 -0.1171910 Country B L 2000-01-02
#5 Country B 2000-01-01 1.3433921 Country A W 2000-01-02
#6 Country B 2000-01-01 -1.1773011 Country A W 2000-01-02
#7 Country B 2000-01-02 -0.6953120 Country A W 2000-01-02
#8 Country B 2000-01-03 1.3484053 Country A W 2000-01-02
#9 Country B 2000-01-03 -0.7266405 Country A W 2000-01-02
#10 Country B 2000-01-03 -0.9139988 Country A W 2000-01-02

fuzzy_left_join with match_fun %in%

If we want to do a partial match with the word before the / in the 'url' with the 'string' column in 'lookup_df', we could extract that substring as a new column and then do a regex_left_join

library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
mutate(string = str_remove(url, "\\/.*")) %>%
regex_left_join(lookup_df, by = 'string') %>%
select(url, numbs, group)

-output

#                   url numbs group
#1 blog/blah 1 blog
#2 blog/?utm_medium=foo 2 blog
#3 blah 3 <NA>
#4 subscription/apples 4 subs
#5 UK/something 5 UK

Join data frames based fuzzy matching of strings

Perhaps this is what you're looking for?

library(dplyr)
library(fuzzyjoin)
library(stringr)
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = str_detect)
## A tibble: 2 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900

If you wanted to ignore case, you could define your own str_detect.

my_str_detect <- function(x,y){str_detect(x,regex(y, ignore_case = TRUE))}
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_str_detect)
## A tibble: 3 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900
#3 FedEx USA Commercial 4 FedEX USA 900

For bonus points you can use agrepl from this question.

You can modify the max.distance = argument and potentially add cost =. See help(agrepl) for more.

my_match_fun <- Vectorize(function(x,y) agrepl(x, y, ignore.case=TRUE, max.distance = 0.7, useBytes = TRUE))
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_match_fun)
## A tibble: 4 x 4
# col1 col2 col3 col4
# <chr> <int> <chr> <dbl>
#1 Banana Shipping 2 Banana 700
#2 FedEX USA Ground 3 FedEX USA 900
#3 FedEx USA Commercial 4 FedEX USA 900
#4 FedEx International 5 FedEX USA 900

Selective left join in r

1) This left joins df1 with df2 on zipcode but only joins rows for which exp.id is 0. For other rows pct is NA as in the expected result shown in the question. Note that dot is an SQL operator so we surround exp.id with square brackets to escape the name.

library(sqldf)

sqldf("select a.id, a.zipcode, b.pct
from df1 a
left join df2 b on a.zipcode = b.zipcode and [exp.id] = 0")
## id zipcode pct
## 1 1 11111 0.1
## 2 2 44444 0.7
## 3 3 33333 <NA>

2) This is like (1) but returns only the exp.id rows that are zero. This is different than what is asked for in the question but a comment suggested that it is of interest.

The difference between the code here and (1) illustrate the subtle difference between including a condition in on and in where. Because we have a simple condition in this case we can use the using clause instead of on. using results in a single zipcode so we don't need to distinguish between a.zipcode and b.zipcode.

sqldf("select a.id, zipcode, b.pct
from df1 a left join df2 b using(zipcode)
where [exp.id] = 0")
## id zipcode pct
## 1 1 11111 0.1
## 2 2 44444 0.7

Note that the SQL engine internally creates a query plan to optimize the calculation while maintaining the same output. It does not necessarily perform the operations in the order written, i.e. it does not necessarily perform the join and then reduce the result but may reduce df1 first to improve performance as that gives the same result. We display information on the query plan below and we see that, indeed, it scans df1 first.

sqldf("explain query plan select a.id, zipcode, b.pct
from df1 a left join df2 b using(zipcode)
where [exp.id] = 0")
## id parent notused detail
## 1 3 0 0 SCAN TABLE df1 AS a
## 2 16 0 0 SEARCH TABLE df2 AS b USING AUTOMATIC COVERING INDEX (zipcode=?)


Related Topics



Leave a reply



Submit