Fuzzyjoin Two Data Frames Using Data.Table

fuzzyjoin two data frames using data.table

To clarify terminology:

The data.table approach for your problem does not require a fuzzyjoin with data.table [at least not in the sense of inexact matching]. Instead, you just want to join on data.table columns using non-equal binary operators >=,>, <= and/or <. In data.table terminology those are called "non equi joins".

Where you titled your question "fuzzyjoin two data frames using data.table" that is just, understandably, after you used library(fuzzyjoin) in your first working attempt. (No problem, just clarifying for readers.)

Solution using data.table non equi joins to compare date columns:

You were very close to a working data.table solution where you had:

dt_final_data <- setDT(df2)[df1, 
on = .(ID, date > start_date, date <= end_date)]

To modify it to make it work as you want, simply add a data.table j expression to select the columns you want, in the order you want them EDIT: and prefix the problem column with x. (to tell data.table to return the column from the x side of the dt_x[dt_i,] join) For example, as below calling the column x.date:

dt_final_data <- setDT(df2)[df1, 
.(ID, f_date, ACCNUM, flmNUM, start_date, end_date, x.date, fyear, at, lt),
on = .(ID, date > start_date, date <= end_date)]

This now gives you the output you are after:

dt_final_data
ID f_date ACCNUM flmNUM start_date end_date x.date fyear at lt
1: 50341 2002-03-08 0001104659-02-000656 2571187 2002-09-07 2003-08-30 2002-12-31 2002 190453.000 173620.000
2: 1067983 2009-11-25 0001047469-09-010426 91207220 2010-05-27 2011-05-19 2010-12-31 2010 372229.000 209295.000
3: 804753 2004-05-14 0001193125-04-088404 4805453 2004-11-13 2005-11-05 2004-12-31 2004 982.265 383.614
4: 1090727 2013-05-22 0000712515-13-000022 13865105 2013-11-21 2014-11-13 2013-12-31 2013 36212.000 29724.000
5: 1467858 2010-02-26 0001193125-10-043035 10640035 2010-08-28 2011-08-20 2010-12-31 2010 138898.000 101739.000
6: 858877 2019-01-31 0001166691-19-000005 19556540 2019-08-02 2020-07-24 <NA> NA NA NA
7: 2488 2016-02-24 0001193125-16-476010 161452982 2016-08-25 2017-08-17 2016-12-31 2016 3321.000 2905.000
8: 1478242 2004-03-12 0001193125-04-039482 4664082 2004-09-11 2005-09-03 <NA> NA NA NA
9: 1467858 2017-02-16 0001555280-17-000044 17618235 2017-08-18 2018-08-10 2017-12-31 2017 212482.000 176282.000
10: 14693 2015-10-28 0001193125-15-356351 151180619 2016-04-28 2017-04-20 2016-04-30 2015 4183.000 2621.000

As above, your result for ID=50341 now has date=2002-12-31. In other words, the result column date now comes from df2.date.

You can of course rename the x.date column in your j expression:

setDT(df2)[ df1, 
.(ID,
f_date,
ACCNUM,
flmNUM,
start_date,
end_date,
my_result_date_name = x.date,
fyear,
at,
lt),
on = .(ID, date > start_date, date <= end_date)]

Why does data.table (currently) rename columns in non-equi joins and return data from a different column:

This explanation from @ScottRitchie sums it up quite nicely:

When performing any join, only one copy of each key column is returned in the result. Currently, the column from i is returned, and labelled with the column name from x, making equi joins consistent with the behaviour of base merge().

Above makes sense if you keep in mind back before version 1.9.8 data.table didn't have non-equi joins.

Through and including the current 1.12.2 release of data.table, this (and several overlapping issues) have been the source a lot of discussion on the data.table github issues list. For example:
possible inconsistency in non-equi join, returning join columns #3437 and
SQL-like column return for non-equi and rolling joins #2706 are just 2 of many.

However, watch this github issue: Continuing from the above discussions the keen analytical minds of the data.table team are working to make this less confusing in some (hopefully not too distant) future version:
Both columns for rolling and non-equi joins #3093

Merge two data frames in R by variable that is regular expression in one and string in other

Answer from Comment by @jblood94

With a and b as data.tables: a[, regex_inner_join(b[year == .BY], .SD, by = "ID"), year]
jblood94

Join two data frames using the data.table package in R based on dates that are within +/- 3 months

Rolling joins in data.table are extremely useful but can be a little bit tough to get the hang of. The syntax for rollends is quite a bit different than what you've got there, it's not designed to handle any kind of complex logic, just simple a TRUE/FALSE case.

At any rate, here's one way to approach this problem. Using month arithmetic as a filtering criteria in combination with the nearest value requirement makes this a multi-step problem instead of a one-step join (at least any way I can see it).

While the join/filter/copy result values operation is technically a one-liner, I did my best to add in plenty of explanation of the nested operations.

## Make a copy of Date2 to use as key, as it will be inaccessible within the joined table
df2[, Date2Copy := Date2]

## Set Keys
setkey(df1,ID,Date1)
setkey(df2,Id,Date2Copy)

## Step 3: (read the inner nested steps first!)
## After performing the steps 1/2, join the intermediate result table back to `df1`...
df1[
## Step 1:
## First use the key of `df1` to subset `df2`` with a rolling join
df2[df1,.(ID, Date1, Date2), roll = "nearest"
## Step 2:
## Then apply the +/- 3 month filtering critera
][between(Date2,
Date1 %m-% months(3),
Date1 %m+% months(3))]
## Step 3:
## ...on the `ID` column and add the intermediate results
## for `Date2` and `Value` columns to `df1` by reference
, c("Date2","Value") := .(i.Date2,i.Value), on = .(ID)]

## Results
print(df1)
# ID Date1 Date2 Value
# 1: 1 2019-09-09 2019-10-09 7
# 2: 2 2019-09-09 <NA> NA
# 3: 3 2019-09-09 <NA> NA
# 4: 4 2019-09-09 2019-10-27 15

These are my three go-to resources (other than the package documentation) for rolling joins, they've all helped me understand some of the quirks at multiple points over the years.

  • https://r-norberg.blogspot.com/2016/06/understanding-datatable-rolling-joins.html
  • https://www.gormanalysis.com/blog/r-data-table-rolling-joins/
  • http://franknarf1.github.io/r-tutorial/_book/tables.html#tables

R: Fuzzy merge using agrep and data.table

A possible solution using 'fuzzyjoin':

library(fuzzyjoin)
f <- Vectorize(function(x,y) agrepl(x, y,
ignore.case=TRUE,
max.distance = 0.05, useBytes = TRUE))

dt1 %>% fuzzy_inner_join(dt2, by="Name", match_fun=f)
# Name.x A Name.y B
#1 ASML HOLDING 1 ASML HOLDING NV p
#2 ABN AMRO GROUP 2 ABN AMRO GROUP q

NOTE : The main problem, that you encountered too, was that agrep and agrepl don't seem to expect the first argument to be a vector. That's the reason why I wrapped the call with Vectorize.

This method can be used together with an equi-join (mind the order of columns in the by!):

dt1 = data.frame(Name = c("ASML HOLDING","ABN AMRO GROUP"), A = c(1,2),Date=c(1,2))
dt2 = data.frame(Name = c("ASML HOLDING NV", "ABN AMRO GROUP", "ABN AMRO GROUP"), B = c("p", "q","r"),Date=c(1,2,3))

dt1 %>% fuzzy_inner_join(dt2, by=c("Date","Name"), match_fun=f) %>% filter(Date.x==Date.y)

How to fuzzyjoin several dataframes in one go using IRanges

Put the dataframes in a list and join the dataframes with Reduce.

library(fuzzyjoin)
library(dplyr)

join_two_dataframes <- function(df1, df2) {
interval_left_join(x = df1,
y = df2,
by = c("start", "end")) %>%
select(-c(start.y, end.y)) %>%
rename(start = start.x, end = end.x)
}

list_df <- list(df1, df2, df3)
Reduce(join_two_dataframes, list_df)

# line start end v2 v3
#1 1 75 100 D a
#2 1 75 100 D b
#3 2 100 150 D b
#4 2 100 150 E b
#5 2 100 150 F b
#6 3 170 190 I b
#7 4 240 300 J <NA>
#8 4 240 300 K <NA>
#9 4 240 300 F <NA>

How to fuzzy join 2 dataframes on 2 variables with differing fuzzy logic?

You can create a cartesian product of two dataframes using merge and then subset the rows which follow our required conditions.

subset(merge(a, b, by = NULL), abs(KW.x - KW.y) <= 1 & 
abs(price.x - price.y) <= 0.02)

# name.x KW.x price.x KW.y price.y name.y
#1 A 201902 1.99 201903 1.98 a
#5 B 201904 3.02 201904 3.00 b
#9 C 201905 5.00 201904 5.00 c


Related Topics



Leave a reply



Submit