R Merge Data Frames, Allow Inexact Id Matching (E.G. with Additional Characters 1234 Matches Ab1234 )

How to join data frames based on condition between 2 columns

For stuff like this I usually turn to SQL:

library(sqldf)
x = sqldf("
  SELECT *
  FROM Data1 d1 JOIN Data2 d2
  ON d1.Hour = d2.Hour2
  AND ABS(d1.Minute - d2.Minute2) <= 1
")

Depending on the size of your data, you could also just join on Hour and then filter. Using dplyr:

library(dplyr)
x = Data1 %>%
  left_join(Data2, by = c("Hour" = "Hour2")) %>%
  filter(abs(Minute - Minute2) <= 1)

though you could do the same thing with base functions.

Check if data.frame is a subset of another data.frame

sapply(
    chk,
    function(v) {
        sum(
            rowSums(sapply(v$a, `==`, lkp$a) &
                sapply(v$b, grepl, x = lkp$b)) > 0
        ) >= nrow(v)
    }
)

sapply(
    chk,
    function(v) {
        sum(
            colSums(
                do.call(
                    `&`,
                    Map(
                        function(x, y) outer(x, y, FUN = Vectorize(function(a, b) grepl(a, b))),
                        v,
                        lkp
                    )
                )
            ) > 0
        ) >= nrow(v)
    }
)

which gives

   c1    c2    c3    c4 
 TRUE  TRUE FALSE FALSE

From string to regex to new string

I would go with a for loop in this case, but looping notably over the rows of the countrycode_data data.frame since that only has some 200 rows whereas the real world original data might be orders of magnitude larger.

Because of the long names, I extract two columns of the country code data:

patt <- countrycode_data$country.name.en.regex[!is.na(countrycode_data$country.name.en.regex)]
name <- countrycode_data$country.name.en[!is.na(countrycode_data$country.name.en.regex)]

Then we can loop to write the new column:

for(i in seq_along(patt)) {
  df$country[grepl(patt[i], df$string, ignore.case=TRUE, perl=TRUE)] <- name[i]
}

As others have pointed out, North Korea doesn't match with the regex specified in the country code data.

R- Subset a corpus by meta data (id) matching partial strings

It could work like this:

library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata:  corpus specific: 0, document level (indexed): 0
# Content:  documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
#   502   704   708 
# "502" "704" "708" 
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata:  corpus specific: 0, document level (indexed): 0
# Content:  documents: 3

You are looking for "US" instead of "0". Have a look at ?grep for details (e.g. fixed=TRUE).

R Merge Data Frames, Allow Inexact Id Matching (E.G. with Additional Characters 1234 Matches Ab1234 )

How to join data frames based on condition between 2 columns

Check if data.frame is a subset of another data.frame

From string to regex to new string

R- Subset a corpus by meta data (id) matching partial strings

Related Topics

Leave a reply