Find Matching Strings Between Two Vectors in R

Find matching strings between two vectors in R

Simple solution:

streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße")
streets = tolower(streets) #Lowercase all
names = c("Berber", "Weg")
names = tolower(names)

sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))

# berber weg
#berberichweg TRUE TRUE
#otto-klemperer-weg FALSE TRUE
#feldmeierbogen FALSE FALSE
#altostraße FALSE FALSE

Find partial matching strings between two vectors in R

I would use the function adist from the package stringdist.

Minimal working example:

Create a vector of non-sense words and call the vector a:

a <- c("gkhk", "ololsol", "tyuil", "tyuio", "etytyuli")

Modify some of the words (with more or less degree of modification) and call that vector b:

b <- c("gwrwkhk", "olseotyuioplsol", "thsyuil", "tasyuio", "etytyuli")

Then calculate the distance between the elements

yourdistance <- adist(x = a, y = b, ignore.case = TRUE)

yourdistance will be a matrix calculating the distance between elements.

     [,1] [,2] [,3] [,4] [,5]
[1,] 3 15 7 7 8
[2,] 7 8 6 7 7
[3,] 7 10 2 3 5
[4,] 7 10 3 2 5
[5,] 8 11 5 5 0

For example, the distance between "etytyuli" in a [5,] and "etytyuli" in b [,5] will be 0 because I did not modify that string from a to b.

Once you have this matrix you can decide what is "close enough" for you and select only those elements. You can also play with the parameter cost that allows you to give different cost to insertions, deletions or substitutions.

You might want to learn more about this at:

https://www.r-bloggers.com/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/

Hope it helps.

R partial string matching between two elements of two vectors, anywhere within element

In base R, use sapply and then use max.col to look at which value was matched:

max.col(sapply(a, grepl, b))
#[1] 4 3 2

This works because the core sapply part returns this matrix:

sapply(a, grepl, b)
# R2 R3 N_3 R1
#[1,] FALSE FALSE FALSE TRUE
#[2,] FALSE FALSE TRUE FALSE
#[3,] FALSE TRUE FALSE FALSE

match two vectors by similar characters/strings in R

You could try:

s <- which(adist(v1,v2) <= 1, TRUE) # 1 is the maximum allowed change
data.frame(v1, v2=replace(NA, s[,1], v2[s[,2]]))
v1 v2
1 yellow <NA>
2 red redx
3 orange <NA>
4 blue blues
5 green grean

Matching words from vectors of strings in R

and here is a highly flexible regex_join solution

library( fuzzyjoin )
library( data.table )
#make data.frames
messy.df <- data.frame( messy ); approved.df <- data.frame( approved )
#create regexes
messy.df$regex <- gsub( " ", "|", messy.df$messy )
#regex join
ans <- regex_full_join( approved.df, messy.df, by = c("approved" = "regex") )
#cast to wide
dcast( setDT(ans), messy~approved, value.var = "messy")[, -1]

# Cotswold Water Park Pit 14 Cotswold Water Park Pit 28 Robinswood Hill
# 1: 14 <NA> <NA>
# 2: <NA> 28 <NA>
# 3: CWP Pit 28 CWP Pit 28 <NA>
# 4: Cotswold 28 Cotswold 28 <NA>
# 5: Pit 28 Pit 28 <NA>
# 6: <NA> <NA> Robinswood

Compare two character vectors in R

Here are some basics to try out:

> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"

Similarly, you could get counts simply as:

> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2

How to find the exact match between 2 vectors?

which(y %in% to_find)
# [1] 4 9 10 15 18
which(to_find %in% y)
# [1] 1 2 3 4 5

Match vectors by pattern

How about using nested sapply:

sapply(x,function(x)type[sapply(type,function(y)grepl(paste0("^",y),x))])
value class value2 value3 class2
"value" "class" "value" "value" "class"

Or if you have unmatched classes:

sapply(x,function(x){z <- type[sapply(type,function(y)grepl(paste0("^",y),x))]; ifelse(length(z) > 0, z, NA)})
value class value2 value3 class2 other
"value" "class" "value" "value" "class" NA


Related Topics



Leave a reply



Submit