Fuzzy Left Join with R

fuzzy LEFT join with R

Voila :)

fuzzy_left_join(df1, df2, match_fun = ci_str_detect, by = c(col1 = "col4"))

Fuzzy Left Join exact + partial string match

The function applied in match_fun doesn't apply to one combination at a time. It applies the function to all combinations so you need to change the detect function :

detect <- function(x, y){ 
  mapply(function(x, y) any(x == y), strsplit(x, '/'), strsplit(y, '/'))
}

and then try :

fuzzyjoin::fuzzy_left_join(x, y, by = c("X1" = "Y1", "X2" = "Y2", "Names"),
                           match_fun = list(`==`, `==`, detect))

#  X1    X2    Names.x Y1    Y2    Names.y
#  <chr> <chr> <chr>   <chr> <chr> <chr>  
#1 5000  a     a/b/c   5000  a     a/j/k  
#2 6000  b     d/e/f   NA    NA    NA     
#3 7000  c     g/h/i   NA    NA    NA

R fuzzy_left_join with time

I ran your code and it ran without error. While the result was NAs, I fixed one thing with the last two list items

match_fun = list(`==`, `==`, `>=`, `<=`)

and got your desired result!

library(fuzzyjoin, quietly = TRUE); library(anytime, quietly = TRUE); library(hms, quietly = TRUE)
#> Warning: package 'fuzzyjoin' was built under R version 3.6.3
#> Warning: package 'anytime' was built under R version 3.6.3
calls.sample <- data.frame(ccode = c("MMM", "K", "A", "CAG", "PM"),
                           Date = c(20111020, 20111021, 20120102, 20110510, 20080710),
                           Time = c("09:30:00", "14:30:00", "11:00:00", "15:30:00", "13:00:00"),
                           Timeint = c("9:28:00", "14:28:00", "10:58:00", "15:28:00", "12:58:00"))
calls.sample$Time <- as_hms(as.character(calls.sample$Time))
calls.sample$Timeint <- as_hms(as.character(calls.sample$Timeint))
stocks.sample <- data.frame(Ticker = c("MMM", "K", "A", "CAG", "PM"),
                            Date = c(20111020, 20111021, 20120102, 20110510, 20080710),
                            Timestamp = c("9:28:00", "14:30:00", "11:00:00", "15:30:00", "13:00:00"),
                            OpenPrice = c(5, 1,6,7,8))
stocks.sample$Timestamp <- as_hms(as.character(stocks.sample$Timestamp))

fuzzy_left_join(calls.sample, stocks.sample,
                by = c("ccode" = "Ticker", 
                       "Date" = "Date", 
                       "Time" = "Timestamp",
                       "Timeint" = "Timestamp"),
                match_fun = list(`==`, `==`, `>=`, `<=`))
#>   ccode   Date.x     Time  Timeint Ticker   Date.y Timestamp OpenPrice
#> 1   MMM 20111020 09:30:00 09:28:00    MMM 20111020  09:28:00         5
#> 2     K 20111021 14:30:00 14:28:00      K 20111021  14:30:00         1
#> 3     A 20120102 11:00:00 10:58:00      A 20120102  11:00:00         6
#> 4   CAG 20110510 15:30:00 15:28:00    CAG 20110510  15:30:00         7
#> 5    PM 20080710 13:00:00 12:58:00     PM 20080710  13:00:00         8

^{Created on 2020-10-20 by the reprex package (v0.3.0)}

Fuzzy Join with Partial String Match in R

We can use fuzzyjoin. Do a regex_left_join after getting the substring from the 'course' columns in both dataset (to make it more matchable)

library(fuzzyjoin)
library(dplyr)
library(stringr)
df2 %>% 
   mutate(grp = toupper(str_remove(course, "^\\d+th\\s+"))) %>% 
   regex_left_join(df1 %>%
       mutate(grp = toupper(str_remove(course, 
     "\\s+grade$")), course = NULL), by = c('student_id', "grp")) %>% 
   select(student_id = student_id.x, course, grade)

-output

# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C

OP's expected output is

 df_final
# A tibble: 9 x 3
  student_id course             grade
  <chr>      <chr>              <chr>
1 001        5th Social Studies A    
2 001        5th ELA            A    
3 001        5th Mathematics    A    
4 002        6th Social Studies B    
5 002        6th ELA            B    
6 002        6th Mathematics    B    
7 003        8th Social Studies C    
8 003        8th ELA            C    
9 003        8th Mathematics    C

fuzzyjoin with dates in R

There are three issues

Replace the double quote with backquote inside the match_fun
the by values should be reversed
'date' columns are changed to respective Date class

library(fuzzyjoin)
library(dplyr)
individual_data$date <- as.Date(individual_data$date)
match_data$match_date_minus3 <- as.Date(match_data$match_date_minus3)
match_data$match_date_plus3 <- as.Date(match_data$match_date_plus3)
fuzzy_left_join(individual_data, match_data,
                                 by = c("country" = "country",
                                        'date' = "match_date_minus3",
                                        'date' = "match_date_plus3"),
                                 match_fun = list(`==`, `>`, `<`)) %>%
  select(country = country.x, date = date.x, outcome, 
          opponent, match_outcome, match_date = date.y)
#     country       date    outcome  opponent match_outcome match_date
#1  Country A 2000-01-01  1.4003662 Country B             L 2000-01-02
#2  Country A 2000-01-02  0.5526607 Country B             L 2000-01-02
#3  Country A 2000-01-03  0.4316405 Country B             L 2000-01-02
#4  Country A 2000-01-04 -0.1171910 Country B             L 2000-01-02
#5  Country B 2000-01-01  1.3433921 Country A             W 2000-01-02
#6  Country B 2000-01-01 -1.1773011 Country A             W 2000-01-02
#7  Country B 2000-01-02 -0.6953120 Country A             W 2000-01-02
#8  Country B 2000-01-03  1.3484053 Country A             W 2000-01-02
#9  Country B 2000-01-03 -0.7266405 Country A             W 2000-01-02
#10 Country B 2000-01-03 -0.9139988 Country A             W 2000-01-02

fuzzy_left_join with match_fun %in%

If we want to do a partial match with the word before the / in the 'url' with the 'string' column in 'lookup_df', we could extract that substring as a new column and then do a regex_left_join

library(dplyr)
library(fuzzyjoin)
library(stringr)
example_df %>%
    mutate(string = str_remove(url, "\\/.*")) %>% 
    regex_left_join(lookup_df, by = 'string') %>%
    select(url, numbs, group)

-output

#                   url numbs group
#1            blog/blah     1  blog
#2 blog/?utm_medium=foo     2  blog
#3                 blah     3  <NA>
#4  subscription/apples     4  subs
#5         UK/something     5    UK

Join data frames based fuzzy matching of strings

Perhaps this is what you're looking for?

library(dplyr)
library(fuzzyjoin)
library(stringr)
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = str_detect)
## A tibble: 2 x 4
#  col1              col2 col3       col4
#  <chr>            <int> <chr>     <dbl>
#1 Banana Shipping      2 Banana      700
#2 FedEX USA Ground     3 FedEX USA   900

If you wanted to ignore case, you could define your own str_detect.

my_str_detect <- function(x,y){str_detect(x,regex(y, ignore_case = TRUE))}
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_str_detect)
## A tibble: 3 x 4
#  col1                  col2 col3       col4
#  <chr>                <int> <chr>     <dbl>
#1 Banana Shipping          2 Banana      700
#2 FedEX USA Ground         3 FedEX USA   900
#3 FedEx USA Commercial     4 FedEX USA   900

For bonus points you can use agrepl from this question.

You can modify the max.distance = argument and potentially add cost =. See help(agrepl) for more.

my_match_fun <- Vectorize(function(x,y) agrepl(x, y, ignore.case=TRUE, max.distance = 0.7, useBytes = TRUE))
df1 %>% fuzzy_inner_join(df2,by=c("col1" = "col3"),match_fun = my_match_fun)
## A tibble: 4 x 4
#  col1                  col2 col3       col4
#  <chr>                <int> <chr>     <dbl>
#1 Banana Shipping          2 Banana      700
#2 FedEX USA Ground         3 FedEX USA   900
#3 FedEx USA Commercial     4 FedEX USA   900
#4 FedEx International      5 FedEX USA   900

Selective left join in r

1) This left joins df1 with df2 on zipcode but only joins rows for which exp.id is 0. For other rows pct is NA as in the expected result shown in the question. Note that dot is an SQL operator so we surround exp.id with square brackets to escape the name.

library(sqldf)

sqldf("select a.id, a.zipcode, b.pct
  from df1 a 
  left join df2 b on a.zipcode = b.zipcode and [exp.id] = 0")
##   id zipcode  pct
## 1  1   11111  0.1
## 2  2   44444  0.7
## 3  3   33333 <NA>

2) This is like (1) but returns only the exp.id rows that are zero. This is different than what is asked for in the question but a comment suggested that it is of interest.

The difference between the code here and (1) illustrate the subtle difference between including a condition in on and in where. Because we have a simple condition in this case we can use the using clause instead of on. using results in a single zipcode so we don't need to distinguish between a.zipcode and b.zipcode.

sqldf("select a.id, zipcode, b.pct
  from df1 a left join df2 b using(zipcode)
  where [exp.id] = 0")
##   id zipcode pct
## 1  1   11111 0.1
## 2  2   44444 0.7

Note that the SQL engine internally creates a query plan to optimize the calculation while maintaining the same output. It does not necessarily perform the operations in the order written, i.e. it does not necessarily perform the join and then reduce the result but may reduce df1 first to improve performance as that gives the same result. We display information on the query plan below and we see that, indeed, it scans df1 first.

sqldf("explain query plan select a.id, zipcode, b.pct
      from df1 a left join df2 b using(zipcode)
      where [exp.id] = 0")
##   id parent notused                                                           detail
## 1  3      0       0                                              SCAN TABLE df1 AS a
## 2 16      0       0 SEARCH TABLE df2 AS b USING AUTOMATIC COVERING INDEX (zipcode=?)

Fuzzy Left Join with R