Removing a Group of Words from a Character Vector

Removing a group of words from a character vector

gsub(x = dat, pattern = paste(car, collapse = "|"), replacement = "")
[1] "Tony" "Dave" "Alex"

Removing words featured in character vector from string

You could use the tm library for this:

require("tm")
removeWords(str,stopwords)
#[1] "I have "

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

See the regex demo.

Details

  • \s* - 0 or more whitespaces
  • (?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
  • (?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.

NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.

See an R demo online:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

Details

  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.

R: Remove vector elements from vector elements

You can try:

`%notin%` <- function(x,y) !(x %in% y)
lapply(strsplit(text," "),function(x) paste(x[x %notin% pattern],collapse=" "))

remove/replace specific words or phrases from character strings - R

dataframename$varname <- gsub(" Parish","", dataframename$varname)

Removing all words except for words in a vector

If you want x as a regex pattern for grep, just use x <- paste(x, collapse = "|"), which will allow you to look for those words in text. But keep in mind that the regex might still be too large. If you want to remove any word that is not a stopword(), you can create your own function:

keep_stopwords <- function(text) {
stop_regex <- paste(stopwords(), collapse = "\\b|\\b")
stop_regex <- paste("\\b", stop_regex, "\\b", sep = "")
tmp <- strsplit(text, " ")[[1]]
idx <- grepl(stop_regex, tmp)
txt <- paste(tmp[idx], collapse = " ")
return(txt)
}

text = "How much wood would a woodchuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_stopwords(text)
# [1] "would a if a could than most would if could but than other"

Basically, we just setup the stopwords() as a regex that will look for any of those words. But we have to be careful about partial matches, so we wrap each stop word in \\b to ensure it's a full match. Then we split the string so that we match each word individually and create an index of the words that are stop words. Then we paste those words together again and return it as a single string.

Edit

Here's another approach, which is simpler and easier to understand. It also doesn't rely on regular expressions, which can be expensive in large documents.

keep_words <- function(text, keep) {
words <- strsplit(text, " ")[[1]]
txt <- paste(words[words %in% keep], collapse = " ")
return(txt)
}
x <- "How much wood would a woodchuck chuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_words(x, stopwords())
# [1] "would a if a could than most could if a could but than other"

how to remove ONLY a specific group of characters from both names and values of dataframe in R

We can pass multiple characters to match within [] in str_remove or gsub. But, not a vector of patterns in gsub as pattern is not vectorized in gsub

library(dplyr)
library(stringr)
df <- df %>%
transmute(across(everything(), str_remove_all,
pattern = "[*_+]", .names = "{str_remove_all(.col, '[*_+]')}"))

-output

df
# A tibble: 3 × 2
a b
<chr> <chr>
1 x x
2 y y
3 z- z-

How to use R, stringr or other package to replace a group of words from a long string?

You can do:

gsub(paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?"), "", a)
# [1] "fda afe faref fae faef afef absolute fgprg"

\\b indicates the word boundary and with | we match several possible words. ( )? checks whether there is a space afterwards and removes that as well.

So we are matching the following expression in gsub:

paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?")
# [1] "\\ba\\b( )?|\\babout\\b( )?|\\bacross\\b( )?"

Or with stringr:

library(stringr)
str_replace_all(a, str_c("\\b", str_c(b$words, collapse = "\\b( )?|"), "\\b( )?"), "")


Related Topics



Leave a reply



Submit