Remove a List of Whole Words That May Contain Special Chars from a Character Vector Without Matching Parts of Words

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

See the regex demo.

Details

\s* - 0 or more whitespaces
(?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
(?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.

NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.

See an R demo online:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

Details

escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.

remove all words from a character vector that are NOT certain words

Consider the following vector:

v <- c("Final data QAQC done on CSF  1-7561", 
       "CIF  1-229", 
       "SEEF  1-68", 
       "CRT  1-19",
       "082015-HOBA-G17-1 changed to offPlot based on GIS review of searched     area")

You could do:

## vector with words to match
cond <- c("CSF", "CIF", "SEEF", "CRT")
## regex that captures digits and tolerates dashes (-) 
reg  <- "(\\d+-?)+$"
## pattern to match either words or regex 
pattern <- paste(c(cond, reg), collapse = "|")

Then use stri_extract_all() from the stringi package:

library(stringi)
stri_extract_all_regex(v, pattern)

Which gives:

#[[1]]
#[1] "CSF"    "1-7561"
#
#[[2]]
#[1] "CIF"   "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT"  "1-19"
#
#[[5]]
#[1] NA

As per mentionned by @akrun, you could also do:

regmatches(v, gregexpr(pattern, v))

Which gives:

#[[1]]
#[1] "CSF"    "1-7561"
#
#[[2]]
#[1] "CIF"   "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT"  "1-19"
#
#[[5]]
#character(0)

Removing words featured in character vector from string

You could use the tm library for this:

require("tm")
removeWords(str,stopwords)
#[1] "I have   "

How to use R, stringr or other package to replace a group of words from a long string?

You can do:

gsub(paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?"), "", a)
# [1] "fda afe faref fae faef afef absolute fgprg"

\\b indicates the word boundary and with | we match several possible words. ( )? checks whether there is a space afterwards and removes that as well.

So we are matching the following expression in gsub:

paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?")
# [1] "\\ba\\b( )?|\\babout\\b( )?|\\bacross\\b( )?"

Or with stringr:

library(stringr)
str_replace_all(a, str_c("\\b", str_c(b$words, collapse = "\\b( )?|"), "\\b( )?"), "")

Remove strings found in vector 1, from vector 2

Try this,

sample1 <- c(".aaa", ".aarp", ".abb", ".abbott", ".abogado")
sample2 <- c("try1.aarp", "www.tryagain.aaa", "255.255.255.255", "onemoretry.abb.abogado")
paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b")
# [1] "(\\.aaa|\\.aarp|\\.abb|\\.abbott|\\.abogado)\\b"
gsub(paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b"), "", sample2)
# [1] "try1"            "www.tryagain"    "255.255.255.255" "onemoretry"

Explanation:

sub("\\.", "\\\\.", sample1) escapes all the dots. Since dots are special chars in regex.
paste(sub("\\.", "\\\\.", sample1), collapse="|") combines all the elements with | as delimiter.
paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b") creates a regex like all the elements present inside a capturing group followed by a word boundary. \\b is a much needed one here . So that it would do an exact word match.

R: removing part of the word in a character string

You may use

[[:punct:]]*span[[:punct:]]*

See the regex demo.

Details

[[:punct:]]* - 0+ punctuations chars
span - a literal substring
[[:punct:]]* - 0+ punctuations chars

R Demo:

words <- c("somethingspan.", "..span?", "spanthank", "great to hear", "yourspan")
words <- gsub("[[:punct:]]*span[[:punct:]]*", "", words) # Remove spans
words <- words[words != ""] # Discard empty elements
paste(words, collapse=" ")  # Concat the elements
## => [1] "something thank great to hear your"

If there result whitespace only elements after removing unwanted strings, you may replace the second step with words <- words[trimws(words) != ""] (instead of words[words != ""]).

r code removing words containing @

You can use the following..

gsub('\\S+@\\S+', 'email', data)

Explanation:

\S matches any non-whitespace character. So here we match for any non-whitespace character (1 or more times) preceded by @ followed by any non-whitespace character (1 or more times)

How to remove Ã‚Â from scraped in text in R?

You could use library(stringr)

text <- "ALBANYÃ,Ã‚ OFFÃ,Ã‚ REBOUND BY"

library(stringr)
str_replace_all(text, "Ã,Ã‚", "")
#> [1] "ALBANY OFF REBOUND BY"

or with gsub :

gsub("Ã,Ã‚","",text)
#> [1] "ALBANY OFF REBOUND BY"

However, I think it is an encoding issue in the first place.
Moreover results of gsub or str_replace_all may difer with encoding, it could be why your text <- gsub(",", "", text) do not work.

You could check encoding with Encoding.

how to delete words from a list in a column in R

First read the data:

dat <- c("Lorem ipsum dolor",
           "sit amet, consectetur adipiscing",
           "elit, sed do eiusmod tempor",
           "incididunt ut labore",
           "et dolore magna aliqua.")
todelete <- c("Lorem", "dolore", "elit")

We can avoid loops with a little smart pasting. The | is an or so we can paste it in, allowing us to remove any loops:

gsub(paste0(todelete, collapse = "|"), "", dat)

Remove a List of Whole Words That May Contain Special Chars from a Character Vector Without Matching Parts of Words