Passing Arguments into Multiple Match_Fun Functions in R Fuzzyjoin::Fuzzy_Join

Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join

I think the error is because the arguments passed into each of the multiple match_fun's mess it up i.e. can't pass extra arguments like ignore_case, originally intended for just the string_dist match_fun, into a match_fun of >=

The solution would be to define my own match_fun's with fixed parameters for arguments. See below where I define my own match_fun_stringdist with fixed parameters. I also implemented it here in another question/answer https://stackoverflow.com/a/44383103/4663008.

# First, need to define match_fun_stringdist 
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {

# Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
ignore_case = FALSE
method = "dl"
max_dist = 99
distance_col = "dist"

if (ignore_case) {
v1 <- stringr::str_to_lower(v1)
v2 <- stringr::str_to_lower(v2)
}

# shortcut for Levenshtein-like methods: if the difference in
# string length is greater than the maximum string distance, the
# edit distance must be at least that large

# length is much faster to compute than string distance
if (method %in% c("osa", "lv", "dl")) {
length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
include <- length_diff <= max_dist

dists <- rep(NA, length(v1))

dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
} else {
# have to compute them all
dists <- stringdist::stringdist(v1, v2, method = method)
}
ret <- dplyr::data_frame(include = (dists <= max_dist))
if (!is.null(distance_col)) {
ret[[distance_col]] <- dists
}
ret
}

and call fuzzy_join

fuzzy_join(data1, data2, 
by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")),
match_fun = list(match_fun_stringdist, `==`, `<=`),
mode = "left")

Limiting the amount of fuzzy string comparisons by comparing by subgroup

You were on the right track - just a few typos/bugs and you need to finish changing/replacing the column names.

Also, in your first one, you will need to figure out how you want to pick the "best match" based on Municipality.dist, Province.dist, and Year.dist.

Maybe the second one works better if you get the years and provinces sorted out first.


DT1 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002), Municipality = c("Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing"), Values = c(0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))

DT2 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002), Municipality = c("Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None", "Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None", "Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None"), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))

library(fuzzyjoin); library(dplyr);

stringdist_join(DT1, DT2,
by = c("Municipality", "Year", "Province"),
mode = "left",
ignore_case = TRUE,
method = "jw",
max_dist = 10,
distance_col = "dist") %>%
group_by(Municipality.x) %>%
slice_min(Municipality.dist)
#> # A tibble: 135 x 12
#> # Groups: Municipality.x [3]
#> Province.x Year.x Municipality.x Values Province.y Year.y Municipality.y
#> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 2000 Anything 0.580 1 2000 Anything
#> 2 1 2000 Anything 0.580 1 2001 Anything
#> 3 1 2000 Anything 0.580 1 2002 Anything
#> 4 1 2000 Anything 0.580 2 2000 Anything
#> 5 1 2000 Anything 0.580 2 2001 Anything
#> 6 1 2000 Anything 0.580 2 2002 Anything
#> 7 1 2000 Anything 0.580 3 2000 Anything
#> 8 1 2000 Anything 0.580 3 2001 Anything
#> 9 1 2000 Anything 0.580 3 2002 Anything
#> 10 1 2001 Anything 0.94 1 2000 Anything
#> # ... with 125 more rows, and 5 more variables: `Other Values` <dbl>,
#> # Municipality.dist <dbl>, Province.dist <dbl>, Year.dist <dbl>, dist <lgl>

stringdist_join(DT1, DT2,
by = "Municipality",
mode = "left",
ignore_case = TRUE,
method = "jw",
max_dist = 10,
distance_col = "dist") %>%
group_by(Municipality.x, Year.x, Province.x) %>%
slice_min(dist)
#> # A tibble: 135 x 9
#> # Groups: Municipality.x, Year.x, Province.x [27]
#> Province.x Year.x Municipality.x Values Province.y Year.y Municipality.y
#> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 2000 Anything 0.580 1 2000 Anything
#> 2 1 2000 Anything 0.580 1 2001 Anything
#> 3 1 2000 Anything 0.580 1 2002 Anything
#> 4 1 2000 Anything 0.580 2 2000 Anything
#> 5 1 2000 Anything 0.580 2 2001 Anything
#> 6 1 2000 Anything 0.580 2 2002 Anything
#> 7 1 2000 Anything 0.580 3 2000 Anything
#> 8 1 2000 Anything 0.580 3 2001 Anything
#> 9 1 2000 Anything 0.580 3 2002 Anything
#> 10 2 2000 Anything 0.580 1 2000 Anything
#> # ... with 125 more rows, and 2 more variables: `Other Values` <dbl>,
#> # dist <dbl>

Created on 2020-12-07 by the reprex package (v0.3.0)

R fuzzy string match to return specific column based on matched string

You are 90% of the way there...

You say you want to

know with which row of data the string was matched from df2

You just need to understand the code you already have. See ?amatch:

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

Illustration of both approaches:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

Joining two dataframes on a condition (grepl)

TLDR

You just need to fix match_fun:

# ...
match_fun = list(`==`, stringr::str_detect),
# ...


Background

You had the right idea, but you went wrong in your interpretation of the match_fun parameter in fuzzyjoin::fuzzy_join(). Per the documentation, match_fun should be a

Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in by (if a named list, it uses the names in x). If only one function is given it is used on all column pairs.

Solution

A simple correction will do the trick, with further formatting by dplyr. For conceptual clarity, I've typographically aligned the by columns with the functions used to match them:

library(dplyr)

# ...
# Existing code
# ...

joined_dfs <- fuzzy_join(
df1, df2,

by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`

mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)

Result

Given your sample data reproduced here

df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)

df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)

this solution should produce the following data.frame for joined_dfs, formatted as requested:

        fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA

Note

Because each ages is coincidentally a unique key, the following join on only *names

fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)

will better illustrate the behavior of matching on substrings:

       fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA

Where You Went Wrong

Error in Type

The value passed to match_fun should be either (the symbol for) a function

fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)

or a list of such (symbols for) functions:

fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)

Instead of providing a list of symbols

match_fun = list(=, grepl)

you incorrectly provided a vector of character strings:

match_fun = c("=", "grepl()")

Error in Syntax

The user should name the functions

`=`
grepl

yet you incorrectly attempted to call them:

=
grepl()

Naming them will pass the functions themselves to match_fun, as intended, whereas calling them will pass their return values*. In R, an operator like = is named using backticks: `=`.

* Assuming the calls didn't fail with errors. Here, they would fail.

Inappropriate Functions

To compare two values for equality, here the character vectors df1$fullnames and df2$lastnames, you should use the relational operator ==; yet you incorrectly supplied the assignment operator =.

Furthermore grepl() is not vectorized in quite the way match_fun desires. While its second argument (x) is indeed a vector

a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.

its first argument (pattern) is (treated as) a single character string:

character string containing a regular expression (or character string for fixed = TRUE) to be matched in the given character vector. Coerced by as.character to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except for regexpr, gregexpr and regexec.

Thus, grepl() is not a

Vectorized function given two columns...

but rather a function given one string (scalar) and one column (vector) of strings.

The answer to your prayers is not grepl() but rather something like stringr::str_detect(), which is

Vectorised over string and pattern. Equivalent to grepl(pattern, x).

and which wraps stringi::stri_detect().

Note

Since you're simply trying to detect whether a literal string in df1$fullnames contains a literal string in df2$lastnames, you don't want to accidentally treat the strings in df2$lastnames as regular expression patterns. Now your df2$lastnames column is statistically unlikely to contain names with special regex characters; with the lone exception of -, which is interpreted literally outside of [], which are very unlikely to be found in a name.

If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed() or stringi::stri_detect_coll(). These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.

Passing string arguments into functions via for loops

The reason of the error thrown that you are trying to apply format function withtrim argument assigned to character type value, namely %Y-%m-%d. However trim should be of boolean type, i.e. code:

format(1, '%Y-%m-%d')

throws:

Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width,
3L, : invalid 'trim' argument

You can use get and assign function to get and set variables. Please see the code below. The algorithm below calculates averaged data for the week day during a year. For the weekday identification lubridate package wday function was used.

library(zoo)
library(lubridate)
date.range <- seq.Date(from=as.Date('2011-01-01'),
to=as.Date('2011-12-31'), by='1 day')
a.zoo <- zoo(1:365,date.range);
b.zoo <- zoo(2:366, date.range);
c.zoo <- zoo(3:367, date.range)
zooNames <- c("a.zoo", "b.zoo", "c.zoo")

for(name in zooNames){
z <- get(name)
w_days <- wday(date.range, label = TRUE, locale = "English_United States")
agg <- aggregate(z, w_days, mean)
assign(paste0(name, ".avg"), agg)
}

a.zoo.avg
b.zoo.avg
c.zoo.avg

Output:

> a.zoo.avg
Sun Mon Tue Wed Thu Fri Sat
180.5 181.5 182.5 183.5 184.5 185.5 183.0
> a.zoo.avg
Sun Mon Tue Wed Thu Fri Sat
180.5 181.5 182.5 183.5 184.5 185.5 183.0
> b.zoo.avg
Sun Mon Tue Wed Thu Fri Sat
181.5 182.5 183.5 184.5 185.5 186.5 184.0
> c.zoo.avg
Sun Mon Tue Wed Thu Fri Sat
182.5 183.5 184.5 185.5 186.5 187.5 185.0

R dplyr join on range of dates

First of all, thank you for trying to help me. I realize my question is incomplete. I moved away from fuzzyjoin because of all the bioconductor dependencies.

I used sqldf instead to accomplish the task:

library(sqldf)
sqldf("SELECT * FROM xxx
LEFT JOIN yyy
ON xxx.ID = yyy.ID
AND xxx.NRA = yyy.NRA
AND yyy.date BETWEEN xxx.date_low AND xxx.date_high")

The result is almost identical to this question but I suspect it can also be solved with that question as per Uwe's data.table solution.

I am also linking this rstudio response



Related Topics



Leave a reply



Submit