How to join data frames based on condition between 2 columns
For stuff like this I usually turn to SQL:
library(sqldf)
x = sqldf("
SELECT *
FROM Data1 d1 JOIN Data2 d2
ON d1.Hour = d2.Hour2
AND ABS(d1.Minute - d2.Minute2) <= 1
")
Depending on the size of your data, you could also just join on Hour
and then filter. Using dplyr
:
library(dplyr)
x = Data1 %>%
left_join(Data2, by = c("Hour" = "Hour2")) %>%
filter(abs(Minute - Minute2) <= 1)
though you could do the same thing with base
functions.
Check if data.frame is a subset of another data.frame
sapply(
chk,
function(v) {
sum(
rowSums(sapply(v$a, `==`, lkp$a) &
sapply(v$b, grepl, x = lkp$b)) > 0
) >= nrow(v)
}
)
or
sapply(
chk,
function(v) {
sum(
colSums(
do.call(
`&`,
Map(
function(x, y) outer(x, y, FUN = Vectorize(function(a, b) grepl(a, b))),
v,
lkp
)
)
) > 0
) >= nrow(v)
}
)
which gives
c1 c2 c3 c4
TRUE TRUE FALSE FALSE
From string to regex to new string
I would go with a for loop in this case, but looping notably over the rows of the countrycode_data data.frame since that only has some 200 rows whereas the real world original data might be orders of magnitude larger.
Because of the long names, I extract two columns of the country code data:
patt <- countrycode_data$country.name.en.regex[!is.na(countrycode_data$country.name.en.regex)]
name <- countrycode_data$country.name.en[!is.na(countrycode_data$country.name.en.regex)]
Then we can loop to write the new column:
for(i in seq_along(patt)) {
df$country[grepl(patt[i], df$string, ignore.case=TRUE, perl=TRUE)] <- name[i]
}
As others have pointed out, North Korea doesn't match with the regex specified in the country code data.
R- Subset a corpus by meta data (id) matching partial strings
It could work like this:
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
(corp <- VCorpus(DirSource(reut21578), list(reader = readReut21578XMLasPlain)))
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 20
(idx <- grep("0", sapply(meta(corp, "id"), paste0), value=TRUE))
# 502 704 708
# "502" "704" "708"
(corpsubset <- corp[idx] )
# <<VCorpus>>
# Metadata: corpus specific: 0, document level (indexed): 0
# Content: documents: 3
You are looking for "US"
instead of "0"
. Have a look at ?grep
for details (e.g. fixed=TRUE
).
Related Topics
How Is J() Function Implemented in Data.Table
Add Missing Rows to a Data Table
Rscript Could Not Find Function
Extract Certain Files from .Zip
How to Figure Third Friday of a Month in R
R Doesn't Recognize Pandoc Linux Mint
X^(1/3)' Behaves Differently for Negative Scalar 'X' and Vector 'X' with Negative Values
Using User-Defined "For Loop" Function to Construct a Data Frame
Higher Level Functions in R - Is There an Official Compose Operator or Curry Function
Find Matches of a Vector of Strings in Another Vector of Strings
Difference Between Sort(), Rank(), and Order()
1-Dimensional Matrix Is Changed to a Vector in R
How to Replace Certain Values in a Specific Rows and Columns with Na in R