Passing arguments into multiple match_fun functions in R fuzzyjoin::fuzzy_join
I think the error is because the arguments passed into each of the multiple match_fun's mess it up i.e. can't pass extra arguments like ignore_case
, originally intended for just the string_dist match_fun, into a match_fun of >=
The solution would be to define my own match_fun's with fixed parameters for arguments. See below where I define my own match_fun_stringdist with fixed parameters. I also implemented it here in another question/answer https://stackoverflow.com/a/44383103/4663008.
# First, need to define match_fun_stringdist
# Code from stringdist_join from https://github.com/dgrtwo/fuzzyjoin
match_fun_stringdist <- function(v1, v2) {
# Can't pass these parameters in from fuzzy_join because of multiple incompatible match_funs, so I set them here.
ignore_case = FALSE
method = "dl"
max_dist = 99
distance_col = "dist"
if (ignore_case) {
v1 <- stringr::str_to_lower(v1)
v2 <- stringr::str_to_lower(v2)
}
# shortcut for Levenshtein-like methods: if the difference in
# string length is greater than the maximum string distance, the
# edit distance must be at least that large
# length is much faster to compute than string distance
if (method %in% c("osa", "lv", "dl")) {
length_diff <- abs(stringr::str_length(v1) - stringr::str_length(v2))
include <- length_diff <= max_dist
dists <- rep(NA, length(v1))
dists[include] <- stringdist::stringdist(v1[include], v2[include], method = method)
} else {
# have to compute them all
dists <- stringdist::stringdist(v1, v2, method = method)
}
ret <- dplyr::data_frame(include = (dists <= max_dist))
if (!is.null(distance_col)) {
ret[[distance_col]] <- dists
}
ret
}
and call fuzzy_join
fuzzy_join(data1, data2,
by = list(x = c("Address1", "AREACODE", "Year1"), y = c("Address2", "AREA_CODE", "Year2")),
match_fun = list(match_fun_stringdist, `==`, `<=`),
mode = "left")
Limiting the amount of fuzzy string comparisons by comparing by subgroup
You were on the right track - just a few typos/bugs and you need to finish changing/replacing the column names.
Also, in your first one, you will need to figure out how you want to pick the "best match" based on Municipality.dist, Province.dist, and Year.dist.
Maybe the second one works better if you get the years and provinces sorted out first.
DT1 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002), Municipality = c("Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing"), Values = c(0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))
DT2 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002), Municipality = c("Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None", "Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None", "Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None"), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))
library(fuzzyjoin); library(dplyr);
stringdist_join(DT1, DT2,
by = c("Municipality", "Year", "Province"),
mode = "left",
ignore_case = TRUE,
method = "jw",
max_dist = 10,
distance_col = "dist") %>%
group_by(Municipality.x) %>%
slice_min(Municipality.dist)
#> # A tibble: 135 x 12
#> # Groups: Municipality.x [3]
#> Province.x Year.x Municipality.x Values Province.y Year.y Municipality.y
#> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 2000 Anything 0.580 1 2000 Anything
#> 2 1 2000 Anything 0.580 1 2001 Anything
#> 3 1 2000 Anything 0.580 1 2002 Anything
#> 4 1 2000 Anything 0.580 2 2000 Anything
#> 5 1 2000 Anything 0.580 2 2001 Anything
#> 6 1 2000 Anything 0.580 2 2002 Anything
#> 7 1 2000 Anything 0.580 3 2000 Anything
#> 8 1 2000 Anything 0.580 3 2001 Anything
#> 9 1 2000 Anything 0.580 3 2002 Anything
#> 10 1 2001 Anything 0.94 1 2000 Anything
#> # ... with 125 more rows, and 5 more variables: `Other Values` <dbl>,
#> # Municipality.dist <dbl>, Province.dist <dbl>, Year.dist <dbl>, dist <lgl>
stringdist_join(DT1, DT2,
by = "Municipality",
mode = "left",
ignore_case = TRUE,
method = "jw",
max_dist = 10,
distance_col = "dist") %>%
group_by(Municipality.x, Year.x, Province.x) %>%
slice_min(dist)
#> # A tibble: 135 x 9
#> # Groups: Municipality.x, Year.x, Province.x [27]
#> Province.x Year.x Municipality.x Values Province.y Year.y Municipality.y
#> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 2000 Anything 0.580 1 2000 Anything
#> 2 1 2000 Anything 0.580 1 2001 Anything
#> 3 1 2000 Anything 0.580 1 2002 Anything
#> 4 1 2000 Anything 0.580 2 2000 Anything
#> 5 1 2000 Anything 0.580 2 2001 Anything
#> 6 1 2000 Anything 0.580 2 2002 Anything
#> 7 1 2000 Anything 0.580 3 2000 Anything
#> 8 1 2000 Anything 0.580 3 2001 Anything
#> 9 1 2000 Anything 0.580 3 2002 Anything
#> 10 2 2000 Anything 0.580 1 2000 Anything
#> # ... with 125 more rows, and 2 more variables: `Other Values` <dbl>,
#> # dist <dbl>
Created on 2020-12-07 by the reprex package (v0.3.0)
R fuzzy string match to return specific column based on matched string
You are 90% of the way there...
You say you want to
know with which row of data the string was matched from df2
You just need to understand the code you already have. See ?amatch
:
amatch
returns the position of the closest match ofx
intable
. When multiple matches with the same smallest distance metric exist, the first one is returned.
In other words, amatch
gives you the index for the row in df2
(which is your table
) that is the closest match of each address in df1
(which is your x
). You are prematurely wrapping this index by returning the new address instead.
Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.
Illustration of both approaches:
library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater
# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}
# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options
# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]
Joining two dataframes on a condition (grepl)
TLDR
You just need to fix match_fun
:
# ...
match_fun = list(`==`, stringr::str_detect),
# ...
Background
You had the right idea, but you went wrong in your interpretation of the match_fun
parameter in fuzzyjoin::fuzzy_join()
. Per the documentation, match_fun
should be a
Vectorized function given two columns, returning TRUE or FALSE as to whether they are a match. Can be a list of functions one for each pair of columns specified in
by
(if a named list, it uses the names in x). If only one function is given it is used on all column pairs.
Solution
A simple correction will do the trick, with further formatting by dplyr
. For conceptual clarity, I've typographically aligned the by
columns with the function
s used to match them:
library(dplyr)
# ...
# Existing code
# ...
joined_dfs <- fuzzy_join(
df1, df2,
by = c("ages", "fullnames" = "lastnames"),
# |----| |-----------------------|
match_fun = list(`==` , stringr::str_detect ),
# |--| |-----------------|
# Match by equality ^ ^ Match by detection of `lastnames` in `fullnames`
mode = "left"
) %>%
# Format resulting dataset as you requested.
select(fullnames, ages = ages.x, homestate)
Result
Given your sample data reproduced here
df1 <- data.frame(
fullnames = c("Jane Doe", "Mr. John Smith", "Nate Cox, Esq.", "Bill Lee III", "Ms. Kate Smith"),
ages = c(30, 51, 45, 38, 20)
)
df2 <- data.frame(
lastnames = c("Doe", "Cox", "Smith", "Jung", "Smith", "Lee"),
ages = c(30, 45, 20, 28, 51, 38),
homestate = c("NJ", "CT", "MA", "RI", "MA", "NY")
)
this solution should produce the following data.frame
for joined_dfs
, formatted as requested:
fullnames ages homestate
1 Jane Doe 30 NJ
2 Mr. John Smith 51 MA
3 Nate Cox, Esq. 45 CT
4 Bill Lee III 38 NY
5 Ms. Kate Smith 20 MA
Note
Because each ages
is coincidentally a unique key, the following join on only *names
fuzzy_join(
df1, df2,
by = c("fullnames" = "lastnames"),
match_fun = stringr::str_detect,
mode = "left"
)
will better illustrate the behavior of matching on substrings:
fullnames ages.x lastnames ages.y homestate
1 Jane Doe 30 Doe 30 NJ
2 Mr. John Smith 51 Smith 20 MA
3 Mr. John Smith 51 Smith 51 MA
4 Nate Cox, Esq. 45 Cox 45 CT
5 Bill Lee III 38 Lee 38 NY
6 Ms. Kate Smith 20 Smith 20 MA
7 Ms. Kate Smith 20 Smith 51 MA
Where You Went Wrong
Error in Type
The value passed to match_fun
should be either (the symbol
for) a function
fuzzyjoin::fuzzy_join(
# ...
match_fun = grepl
# ...
)
or a list
of such (symbol
s for) function
s:
fuzzyjoin::fuzzy_join(
# ...
match_fun = list(`=`, grepl)
# ...
)
Instead of providing a list
of symbol
s
match_fun = list(=, grepl)
you incorrectly provided a vector
of character
strings:
match_fun = c("=", "grepl()")
Error in Syntax
The user should name the function
s
`=`
grepl
yet you incorrectly attempted to call them:
=
grepl()
Naming them will pass the function
s themselves to match_fun
, as intended, whereas calling them will pass their return values*. In R, an operator like =
is named using backticks: `=`
.
* Assuming the calls didn't fail with errors. Here, they would fail.
Inappropriate Functions
To compare two values for equality, here the character
vectors df1$fullnames
and df2$lastnames
, you should use the relational operator ==
; yet you incorrectly supplied the assignment operator =
.
Furthermore grepl()
is not vectorized in quite the way match_fun
desires. While its second argument (x
) is indeed a vector
a character vector where matches are sought, or an object which can be coerced by as.character to a character vector. Long vectors are supported.
its first argument (pattern
) is (treated as) a single character
string:
character string containing a regular expression (or character string for
fixed = TRUE
) to be matched in the given character vector. Coerced byas.character
to a character string if possible. If a character vector of length 2 or more is supplied, the first element is used with a warning. Missing values are allowed except forregexpr
,gregexpr
andregexec
.
Thus, grepl()
is not a
Vectorized function given two columns...
but rather a function
given one string (scalar) and one column (vector) of strings.
The answer to your prayers is not grepl()
but rather something like stringr::str_detect()
, which is
Vectorised over
string
andpattern
. Equivalent togrepl(pattern, x)
.
and which wraps stringi::stri_detect()
.
Note
Since you're simply trying to detect whether a literal string in df1$fullnames
contains a literal string in df2$lastnames
, you don't want to accidentally treat the strings in df2$lastnames
as regular expression patterns. Now your df2$lastnames
column is statistically unlikely to contain names with special regex characters; with the lone exception of -
, which is interpreted literally outside of []
, which are very unlikely to be found in a name.
If you're still worried about accidental regex, you might want to consider alternative search methods with stringi::stri_detect_fixed()
or stringi::stri_detect_coll()
. These perform literal matching, respectively by either byte or "canonical equivalence"; the latter adjusts for locale and special characters, in keeping with natural language processing.
Passing string arguments into functions via for loops
The reason of the error thrown that you are trying to apply format
function withtrim
argument assigned to character
type value, namely %Y-%m-%d
. However trim
should be of boolean
type, i.e. code:
format(1, '%Y-%m-%d')
throws:
Error in prettyNum(.Internal(format(x, trim, digits, nsmall, width,
3L, : invalid 'trim' argument
You can use get
and assign
function to get and set variables. Please see the code below. The algorithm below calculates averaged data for the week day during a year. For the weekday identification lubridate
package wday
function was used.
library(zoo)
library(lubridate)
date.range <- seq.Date(from=as.Date('2011-01-01'),
to=as.Date('2011-12-31'), by='1 day')
a.zoo <- zoo(1:365,date.range);
b.zoo <- zoo(2:366, date.range);
c.zoo <- zoo(3:367, date.range)
zooNames <- c("a.zoo", "b.zoo", "c.zoo")
for(name in zooNames){
z <- get(name)
w_days <- wday(date.range, label = TRUE, locale = "English_United States")
agg <- aggregate(z, w_days, mean)
assign(paste0(name, ".avg"), agg)
}
a.zoo.avg
b.zoo.avg
c.zoo.avg
Output:
> a.zoo.avg
Sun Mon Tue Wed Thu Fri Sat
180.5 181.5 182.5 183.5 184.5 185.5 183.0
> a.zoo.avg
Sun Mon Tue Wed Thu Fri Sat
180.5 181.5 182.5 183.5 184.5 185.5 183.0
> b.zoo.avg
Sun Mon Tue Wed Thu Fri Sat
181.5 182.5 183.5 184.5 185.5 186.5 184.0
> c.zoo.avg
Sun Mon Tue Wed Thu Fri Sat
182.5 183.5 184.5 185.5 186.5 187.5 185.0
R dplyr join on range of dates
First of all, thank you for trying to help me. I realize my question is incomplete. I moved away from fuzzyjoin
because of all the bioconductor
dependencies.
I used sqldf
instead to accomplish the task:
library(sqldf)
sqldf("SELECT * FROM xxx
LEFT JOIN yyy
ON xxx.ID = yyy.ID
AND xxx.NRA = yyy.NRA
AND yyy.date BETWEEN xxx.date_low AND xxx.date_high")
The result is almost identical to this question but I suspect it can also be solved with that question as per Uwe's data.table
solution.
I am also linking this rstudio response
Related Topics
How to Create a Variable of Rownames
Removing Traces by Name Using Plotlyproxy (Or Accessing Output Schema in Reactive Context)
Converting a "Map" Object to a "Spatialpolygon" Object
Binning Data, Finding Results by Group, and Plotting Using R
Compute All Pairwise Differences Within a Vector in R
Add a Vector to All Rows of a Matrix
How to Load a Matlab Struct into a R Data Frame
Efficient Multiplication of Columns in a Data Frame
Calculate Summary Statistics (E.G. Mean) on All Numeric Columns Using Data.Table
Flexdashboard - Change Title Bar Color
How to Output a Stem and Leaf Plot as a Plot
How to Create a Plot with Customized Points in R
Are Factors Stored More Efficiently in Data.Table Than Characters