Find matching strings between two vectors in R
Simple solution:
streets = c("Berberichweg", "Otto-Klemperer-Weg", "Feldmeierbogen" , "Altostraße")
streets = tolower(streets) #Lowercase all
names = c("Berber", "Weg")
names = tolower(names)
sapply(names, function (y) sapply(streets, function (x) grepl(y, x)))
# berber weg
#berberichweg TRUE TRUE
#otto-klemperer-weg FALSE TRUE
#feldmeierbogen FALSE FALSE
#altostraße FALSE FALSE
Find partial matching strings between two vectors in R
I would use the function adist
from the package stringdist
.
Minimal working example:
Create a vector of non-sense words and call the vector a:
a <- c("gkhk", "ololsol", "tyuil", "tyuio", "etytyuli")
Modify some of the words (with more or less degree of modification) and call that vector b:
b <- c("gwrwkhk", "olseotyuioplsol", "thsyuil", "tasyuio", "etytyuli")
Then calculate the distance between the elements
yourdistance <- adist(x = a, y = b, ignore.case = TRUE)
yourdistance
will be a matrix calculating the distance between elements.
[,1] [,2] [,3] [,4] [,5]
[1,] 3 15 7 7 8
[2,] 7 8 6 7 7
[3,] 7 10 2 3 5
[4,] 7 10 3 2 5
[5,] 8 11 5 5 0
For example, the distance between "etytyuli" in a [5,] and "etytyuli" in b [,5] will be 0 because I did not modify that string from a to b.
Once you have this matrix you can decide what is "close enough" for you and select only those elements. You can also play with the parameter cost that allows you to give different cost to insertions, deletions or substitutions.
You might want to learn more about this at:
https://www.r-bloggers.com/fuzzy-string-matching-a-survival-skill-to-tackle-unstructured-information/
Hope it helps.
R partial string matching between two elements of two vectors, anywhere within element
In base R, use sapply
and then use max.col
to look at which value was matched:
max.col(sapply(a, grepl, b))
#[1] 4 3 2
This works because the core sapply
part returns this matrix:
sapply(a, grepl, b)
# R2 R3 N_3 R1
#[1,] FALSE FALSE FALSE TRUE
#[2,] FALSE FALSE TRUE FALSE
#[3,] FALSE TRUE FALSE FALSE
match two vectors by similar characters/strings in R
You could try:
s <- which(adist(v1,v2) <= 1, TRUE) # 1 is the maximum allowed change
data.frame(v1, v2=replace(NA, s[,1], v2[s[,2]]))
v1 v2
1 yellow <NA>
2 red redx
3 orange <NA>
4 blue blues
5 green grean
Matching words from vectors of strings in R
and here is a highly flexible regex_join solution
library( fuzzyjoin )
library( data.table )
#make data.frames
messy.df <- data.frame( messy ); approved.df <- data.frame( approved )
#create regexes
messy.df$regex <- gsub( " ", "|", messy.df$messy )
#regex join
ans <- regex_full_join( approved.df, messy.df, by = c("approved" = "regex") )
#cast to wide
dcast( setDT(ans), messy~approved, value.var = "messy")[, -1]
# Cotswold Water Park Pit 14 Cotswold Water Park Pit 28 Robinswood Hill
# 1: 14 <NA> <NA>
# 2: <NA> 28 <NA>
# 3: CWP Pit 28 CWP Pit 28 <NA>
# 4: Cotswold 28 Cotswold 28 <NA>
# 5: Pit 28 Pit 28 <NA>
# 6: <NA> <NA> Robinswood
Compare two character vectors in R
Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
How to find the exact match between 2 vectors?
which(y %in% to_find)
# [1] 4 9 10 15 18
which(to_find %in% y)
# [1] 1 2 3 4 5
Match vectors by pattern
How about using nested sapply
:
sapply(x,function(x)type[sapply(type,function(y)grepl(paste0("^",y),x))])
value class value2 value3 class2
"value" "class" "value" "value" "class"
Or if you have unmatched classes:
sapply(x,function(x){z <- type[sapply(type,function(y)grepl(paste0("^",y),x))]; ifelse(length(z) > 0, z, NA)})
value class value2 value3 class2 other
"value" "class" "value" "value" "class" NA
Related Topics
R Grep Pattern Regex with Brackets
Overlay Grid Rather Than Draw on Top of It
New R-Studio Version 0.98.932 Deletes .Md File - How to Prevent
Ordering Stacks by Size in a Ggplot2 Stacked Bar Graph
"'\W' Is an Unrecognized Escape" in Grep
Obtaining Connected Components of Neighboring Values
Adding 15 Business Days in Lubridate
Calling a User-Defined R Function from C++ Using Rcpp
Get Margin Line Locations in Log Space
Aggregating All Unique Values of Each Column of Data Frame
Format Ttest Output by R for Tex
Trouble Passing on an Argument to Function Within Own Function
How to Select Non-Numeric Columns Using Dplyr::Select_If
Rcpp Warning: "Directory Not Found for Option '-L/Usr/Local/Cellar/Gfortran/4.8.2/Gfortran'"