R Fuzzy String Match to Return Specific Column Based on Matched String

R fuzzy string match to return specific column based on matched string

You are 90% of the way there...

You say you want to

know with which row of data the string was matched from df2

You just need to understand the code you already have. See ?amatch:

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

Illustration of both approaches:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

How to fuzzy match by words (not letters) in R?

This is a tidyverse way to do the join. It basically finds full_name from B that has the highest number of common words with A.
library(tidyverse)

A1 <- tibble(
nombre_completo = c("martin gallardo", "raul gimenez")
) %>%
mutate(
id_A = row_number()
)

B1 <- tibble(
nombre_completo=c("martin ricardo gallardo", "gimenez raul"),
other_data=c("A", "B")
) %>%
mutate(
id_B = row_number()
)


A2 <- A1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words))

B2 <- B1 %>%
mutate(
name_words = str_split(nombre_completo, pattern = " ")
) %>%
unnest(cols = c(name_words)) %>%
select(name_words, id_B )


left_join(A2, B2, by = "name_words") %>%
group_by(nombre_completo, id_A, id_B) %>%
count() %>% ungroup() %>%
group_by(nombre_completo, id_A) %>%
slice_max(order_by = n) %>%
select("nombre_completo_A" = nombre_completo, id_A, id_B) %>%
left_join(B1, by = "id_B")

Replace string with most frequent fuzzy match

This dplyr pipe statement will return a data frame with 9 rows, one for each of the unique elements in your original words vector. First we group_by the raw column which creates a group for each unique word, then filter by your distance threshold, then find the corresponding word in clean with the highest frequency in the original dataset. In your example all words match themselves except for the two variants of "dog."

Code

words_df %>%
group_by(raw) %>%
filter(dist < 0.085) %>%
summarize(clean = clean[which.max(count)])

Output

# A tibble: 9 x 2
raw clean
<chr> <chr>
1 cat cat
2 con con
3 croak croak
4 cry cry
5 dog dog
6 dogg dog
7 dogy dog
8 don don
9 dot dot

How can I match fuzzy match strings from two datasets?

The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

One-to-one case (requires assignment algorithm):

When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

Not familiar with AGREP but here's example using stringdist for your distance matrix.

library(stringdist)
d <- expand.grid(a$name,b$name) # Distance matrix in long form
names(d) <- c("a_name","b_name")
d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)

# Greedy assignment heuristic (Your favorite heuristic here)
greedyAssign <- function(a,b,d){
x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable,
# 1 for already assigned, -1 for unassigned and unassignable
while(any(x==0)){
min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
a_sel <- a[d==min_d & x==0][1]
b_sel <- b[d==min_d & a == a_sel & x==0][1]
x[a==a_sel & b == b_sel] <- 1
x[x==0 & (a==a_sel|b==b_sel)] <- -1
}
cbind(a=a[x==1],b=b[x==1],d=d[x==1])
}
data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))

Produces the assignment:

       a          b       d
1 Ace Co Ace Co. 0.04762
2 Bayes Bayes Inc. 0.16667
3 asd asdf 0.08333

I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

Many-to-one case (not an assignment problem):

do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))

Produces the result:

   a_name     b_name    dist
1 Ace Co Ace Co. 0.04762
11 Baes Bayes Inc. 0.20000
8 Bayes Bayes Inc. 0.16667
12 Bays Bayes Inc. 0.20000
10 Bcy Bayes Inc. 0.37778
15 asd asdf 0.08333

Edit: use method="jw" to produce desired results. See help("stringdist-package")

Selecting columns based on matching/fuzzy matching value of two columns

The following approach uses foreach instead of your two nested for loops, which should make your computation much faster on large data frames. See this for a nice overview of the package. You should also look at the vignettes.

library(foreach)
library(stringdist)

match.cond <- function(ij, df1, df2) { ## 1.
i = floor((ij-1) / nrow(df2)) + 1
j = ij - (i-1) * nrow(df2)
if ((ain(df2$Arranger[j], df1$Agent[[i]], maxDist=0.3,
method="jw")) == 'TRUE' & (df1$Year[i] == df2$Year[j])){
return(df2[j, c('Rank', 'Share', 'Issues')])
}
}

leadrep <- foreach(ij = 1:(nrow(df1)*nrow(df2)), .combine=rbind) %do% ## 2.
match.cond(ij, df1, df2)

Notes:

  1. match.cond is your match condition encapsulated into a kernel function that will be vectorized by foreach over all pairings of rows from df1 and df2. Its inputs are ij, which is an index to the pairings, and the two data frames. Within match.cond:

    • ij is converted to the row indices i for df1 and j for df2
    • Your condition is evaluated, and if the condition is met,
    • The columns from df2 for the matched row is returned.
  2. This is the foreach call.

    • We loop over the indices ij from 1 to nrow(df1)*nrow(df2), which enumerates all pairings of rows from df1 and df2 and %do% the match.cond function. Note that this is all one line.
    • The .combine=rbind argument states that we want to gather all results from match.cond and bind them as rows.
    • This returns the data frame leadrep

I've tested on your data, which I've dput:

df1 <- structure(list(Year = c(1999, 1999, 1998), Agent = list(c("abn-amro-nv", 
"suntrust banks", "wachovia"), c("jp morgan", "abn-amro-nv"),
c("ba-corp", "boston bks", "nbd"))), .Names = c("Year", "Agent"
), row.names = c(NA, -3L), class = "data.frame")

df2 <- structure(list(Rank = 1:3, Arranger = c("jp morgan", "boston-bank",
"suntrust bk"), Share = c(1.2, 1.8, 2.1), Issues = c(7L, 4L,
3L), Year = c(1999L, 1998L, 1999L)), .Names = c("Rank", "Arranger",
"Share", "Issues", "Year"), class = "data.frame", row.names = c(NA,
-3L))

which gives me the desired result:

print(leadrep)
## Rank Share Issues
##3 3 2.1 3
##2 1 1.2 7
##21 2 1.8 4

Hope this helps.

Limiting the amount of fuzzy string comparisons by comparing by subgroup

You were on the right track - just a few typos/bugs and you need to finish changing/replacing the column names.

Also, in your first one, you will need to figure out how you want to pick the "best match" based on Municipality.dist, Province.dist, and Year.dist.

Maybe the second one works better if you get the years and provinces sorted out first.


DT1 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002), Municipality = c("Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing", "Something", "Anything", "Nothing"), Values = c(0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99, 0.59, 0.58, 0.66, 0.53, 0.94, 0.2, 0.86, 0.85, 0.99)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))

DT2 <- structure(list(Province = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3), Year = c(2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002, 2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002), Municipality = c("Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None", "Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None", "Some", "Anything", "Nothing", "Someth.", "Anything", "Not", "Something", "Anything", "None"), `Other Values` = c(0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01, 0.41, 0.42, 0.34, 0.47, 0.0600000000000001, 0.8, 0.14, 0.15, 0.01)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))

library(fuzzyjoin); library(dplyr);

stringdist_join(DT1, DT2,
by = c("Municipality", "Year", "Province"),
mode = "left",
ignore_case = TRUE,
method = "jw",
max_dist = 10,
distance_col = "dist") %>%
group_by(Municipality.x) %>%
slice_min(Municipality.dist)
#> # A tibble: 135 x 12
#> # Groups: Municipality.x [3]
#> Province.x Year.x Municipality.x Values Province.y Year.y Municipality.y
#> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 2000 Anything 0.580 1 2000 Anything
#> 2 1 2000 Anything 0.580 1 2001 Anything
#> 3 1 2000 Anything 0.580 1 2002 Anything
#> 4 1 2000 Anything 0.580 2 2000 Anything
#> 5 1 2000 Anything 0.580 2 2001 Anything
#> 6 1 2000 Anything 0.580 2 2002 Anything
#> 7 1 2000 Anything 0.580 3 2000 Anything
#> 8 1 2000 Anything 0.580 3 2001 Anything
#> 9 1 2000 Anything 0.580 3 2002 Anything
#> 10 1 2001 Anything 0.94 1 2000 Anything
#> # ... with 125 more rows, and 5 more variables: `Other Values` <dbl>,
#> # Municipality.dist <dbl>, Province.dist <dbl>, Year.dist <dbl>, dist <lgl>

stringdist_join(DT1, DT2,
by = "Municipality",
mode = "left",
ignore_case = TRUE,
method = "jw",
max_dist = 10,
distance_col = "dist") %>%
group_by(Municipality.x, Year.x, Province.x) %>%
slice_min(dist)
#> # A tibble: 135 x 9
#> # Groups: Municipality.x, Year.x, Province.x [27]
#> Province.x Year.x Municipality.x Values Province.y Year.y Municipality.y
#> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 2000 Anything 0.580 1 2000 Anything
#> 2 1 2000 Anything 0.580 1 2001 Anything
#> 3 1 2000 Anything 0.580 1 2002 Anything
#> 4 1 2000 Anything 0.580 2 2000 Anything
#> 5 1 2000 Anything 0.580 2 2001 Anything
#> 6 1 2000 Anything 0.580 2 2002 Anything
#> 7 1 2000 Anything 0.580 3 2000 Anything
#> 8 1 2000 Anything 0.580 3 2001 Anything
#> 9 1 2000 Anything 0.580 3 2002 Anything
#> 10 2 2000 Anything 0.580 1 2000 Anything
#> # ... with 125 more rows, and 2 more variables: `Other Values` <dbl>,
#> # dist <dbl>

Created on 2020-12-07 by the reprex package (v0.3.0)



Related Topics



Leave a reply



Submit