Removing a group of words from a character vector
gsub(x = dat, pattern = paste(car, collapse = "|"), replacement = "")
[1] "Tony" "Dave" "Alex"
Removing words featured in character vector from string
You could use the tm
library for this:
require("tm")
removeWords(str,stopwords)
#[1] "I have "
Remove a list of whole words that may contain special chars from a character vector without matching parts of words
You may use a PCRE regex with a gsub
base R function (it will also work with ICU regex in str_replace_all
):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
See the regex demo.
Details
\s*
- 0 or more whitespaces(?<!\w)
- a negative lookbehind that ensures there is no word char immediately before the current location(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)
- a non-capturing group containing the escaped items inside the character vector with the words you need to remove(?!\w)
- a negative lookahead that ensures there is no word char immediately after the current location.
NOTE: We cannot use \b
word boundary here because the items in the myList
character vector may start/end with non-word characters while \b
meaning is context-dependent.
See an R demo online:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
Details
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
- escapes all special chars that need escaping in a PCRE patternpaste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")
- creats a|
-separated alternative list from the search term vector.
R: Remove vector elements from vector elements
You can try:
`%notin%` <- function(x,y) !(x %in% y)
lapply(strsplit(text," "),function(x) paste(x[x %notin% pattern],collapse=" "))
remove/replace specific words or phrases from character strings - R
dataframename$varname <- gsub(" Parish","", dataframename$varname)
Removing all words except for words in a vector
If you want x
as a regex pattern for grep, just use x <- paste(x, collapse = "|")
, which will allow you to look for those words in text
. But keep in mind that the regex might still be too large. If you want to remove any word that is not a stopword()
, you can create your own function:
keep_stopwords <- function(text) {
stop_regex <- paste(stopwords(), collapse = "\\b|\\b")
stop_regex <- paste("\\b", stop_regex, "\\b", sep = "")
tmp <- strsplit(text, " ")[[1]]
idx <- grepl(stop_regex, tmp)
txt <- paste(tmp[idx], collapse = " ")
return(txt)
}
text = "How much wood would a woodchuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_stopwords(text)
# [1] "would a if a could than most would if could but than other"
Basically, we just setup the stopwords()
as a regex that will look for any of those words. But we have to be careful about partial matches, so we wrap each stop word in \\b
to ensure it's a full match. Then we split the string so that we match each word individually and create an index of the words that are stop words. Then we paste those words together again and return it as a single string.
Edit
Here's another approach, which is simpler and easier to understand. It also doesn't rely on regular expressions, which can be expensive in large documents.
keep_words <- function(text, keep) {
words <- strsplit(text, " ")[[1]]
txt <- paste(words[words %in% keep], collapse = " ")
return(txt)
}
x <- "How much wood would a woodchuck chuck if a woodchuck could chuck wood? More wood than most woodchucks would chuck if woodchucks could chuck wood, but less wood than other creatures like termites."
keep_words(x, stopwords())
# [1] "would a if a could than most could if a could but than other"
how to remove ONLY a specific group of characters from both names and values of dataframe in R
We can pass multiple characters to match within []
in str_remove
or gsub
. But, not a vector of patterns in gsub
as pattern
is not vectorized in gsub
library(dplyr)
library(stringr)
df <- df %>%
transmute(across(everything(), str_remove_all,
pattern = "[*_+]", .names = "{str_remove_all(.col, '[*_+]')}"))
-output
df
# A tibble: 3 × 2
a b
<chr> <chr>
1 x x
2 y y
3 z- z-
How to use R, stringr or other package to replace a group of words from a long string?
You can do:
gsub(paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?"), "", a)
# [1] "fda afe faref fae faef afef absolute fgprg"
\\b
indicates the word boundary and with |
we match several possible words. ( )?
checks whether there is a space afterwards and removes that as well.
So we are matching the following expression in gsub
:
paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?")
# [1] "\\ba\\b( )?|\\babout\\b( )?|\\bacross\\b( )?"
Or with stringr
:
library(stringr)
str_replace_all(a, str_c("\\b", str_c(b$words, collapse = "\\b( )?|"), "\\b( )?"), "")
Related Topics
Specifying the Colour Scale for Maps in Ggplot
Naive Bayes in Quanteda VS Caret: Wildly Different Results
Creating Shiny Reactive Variable That Indicates Which Widget Was Last Modified
Refer to Range of Columns by Name in R
Shiny App File Upload: How to Save the Files Uploaded on a Shiny Gui to a Particular Destination
How to Write a Data-Frame with One Column a List to a File
Paste Several Column Values into One Value in R
Control Number Formatting in Shiny's Implementation of Datatable
Forest Plot with Table Ggplot Coding
Making Binned Scatter Plots for Two Variables in Ggplot2 in R
How to Install the Odbc Driver for Snowflake Successfully on an M1 Apple Silicon MAC
Select Multiple Columns with Dplyr::Select() with Numbers as Names
Nan Is Removed When Using Na.Rm=True
Dplyr:How to Find the First-Non Missing String by Groups