R Merge Data Frames, Allow Inexact Id Matching (E.G. with Additional Characters 1234 Matches Ab1234 )

How to join data frames based on condition between 2 columns

For stuff like this I usually turn to SQL:

library(sqldf)
x = sqldf("
SELECT *
FROM Data1 d1 JOIN Data2 d2
ON d1.Hour = d2.Hour2
AND ABS(d1.Minute - d2.Minute2) <= 1
")

Depending on the size of your data, you could also just join on Hour and then filter. Using dplyr:

library(dplyr)
x = Data1 %>%
left_join(Data2, by = c("Hour" = "Hour2")) %>%
filter(abs(Minute - Minute2) <= 1)

though you could do the same thing with base functions.

Check if data.frame is a subset of another data.frame

sapply(
chk,
function(v) {
sum(
rowSums(sapply(v$a, `==`, lkp$a) &
sapply(v$b, grepl, x = lkp$b)) > 0
) >= nrow(v)
}
)

or

sapply(
chk,
function(v) {
sum(
colSums(
do.call(
`&`,
Map(
function(x, y) outer(x, y, FUN = Vectorize(function(a, b) grepl(a, b))),
v,
lkp
)
)
) > 0
) >= nrow(v)
}
)

which gives

   c1    c2    c3    c4 
TRUE TRUE FALSE FALSE

From string to regex to new string

I would go with a for loop in this case, but looping notably over the rows of the countrycode_data data.frame since that only has some 200 rows whereas the real world original data might be orders of magnitude larger.

Because of the long names, I extract two columns of the country code data:

patt <- countrycode_data$country.name.en.regex[!is.na(countrycode_data$country.name.en.regex)]
name <- countrycode_data$country.name.en[!is.na(countrycode_data$country.name.en.regex)]

Then we can loop to write the new column:

for(i in seq_along(patt)) {
df$country[grepl(patt[i], df$string, ignore.case=TRUE, perl=TRUE)] <- name[i]
}

As others have pointed out, North Korea doesn't match with the regex specified in the country code data.

R- Subset a corpus by meta data (id) matching partial strings

It could work like this:

library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3

You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).



Related Topics



Leave a reply



Submit