Remove a List of Whole Words That May Contain Special Chars from a Character Vector Without Matching Parts of Words

Remove a list of whole words that may contain special chars from a character vector without matching parts of words

You may use a PCRE regex with a gsub base R function (it will also work with ICU regex in str_replace_all):

\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)

See the regex demo.

Details

  • \s* - 0 or more whitespaces
  • (?<!\w) - a negative lookbehind that ensures there is no word char immediately before the current location
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00) - a non-capturing group containing the escaped items inside the character vector with the words you need to remove
  • (?!\w) - a negative lookahead that ensures there is no word char immediately after the current location.

NOTE: We cannot use \b word boundary here because the items in the myList character vector may start/end with non-word characters while \b meaning is context-dependent.

See an R demo online:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

Details

  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) } - escapes all special chars that need escaping in a PCRE pattern
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|") - creats a |-separated alternative list from the search term vector.

remove all words from a character vector that are NOT certain words

Consider the following vector:

v <- c("Final data QAQC done on CSF  1-7561", 
"CIF 1-229",
"SEEF 1-68",
"CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area")

You could do:

## vector with words to match
cond <- c("CSF", "CIF", "SEEF", "CRT")
## regex that captures digits and tolerates dashes (-)
reg <- "(\\d+-?)+$"
## pattern to match either words or regex
pattern <- paste(c(cond, reg), collapse = "|")

Then use stri_extract_all() from the stringi package:

library(stringi)
stri_extract_all_regex(v, pattern)

Which gives:

#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#[1] NA

As per mentionned by @akrun, you could also do:

regmatches(v, gregexpr(pattern, v))

Which gives:

#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#character(0)

Removing words featured in character vector from string

You could use the tm library for this:

require("tm")
removeWords(str,stopwords)
#[1] "I have "

How to use R, stringr or other package to replace a group of words from a long string?

You can do:

gsub(paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?"), "", a)
# [1] "fda afe faref fae faef afef absolute fgprg"

\\b indicates the word boundary and with | we match several possible words. ( )? checks whether there is a space afterwards and removes that as well.

So we are matching the following expression in gsub:

paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?")
# [1] "\\ba\\b( )?|\\babout\\b( )?|\\bacross\\b( )?"

Or with stringr:

library(stringr)
str_replace_all(a, str_c("\\b", str_c(b$words, collapse = "\\b( )?|"), "\\b( )?"), "")

Remove strings found in vector 1, from vector 2

Try this,

sample1 <- c(".aaa", ".aarp", ".abb", ".abbott", ".abogado")
sample2 <- c("try1.aarp", "www.tryagain.aaa", "255.255.255.255", "onemoretry.abb.abogado")
paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b")
# [1] "(\\.aaa|\\.aarp|\\.abb|\\.abbott|\\.abogado)\\b"
gsub(paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b"), "", sample2)
# [1] "try1" "www.tryagain" "255.255.255.255" "onemoretry"

Explanation:

  • sub("\\.", "\\\\.", sample1) escapes all the dots. Since dots are special chars in regex.

  • paste(sub("\\.", "\\\\.", sample1), collapse="|") combines all the elements with | as delimiter.

  • paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b") creates a regex like all the elements present inside a capturing group followed by a word boundary. \\b is a much needed one here . So that it would do an exact word match.

R: removing part of the word in a character string

You may use

[[:punct:]]*span[[:punct:]]*

See the regex demo.

Details

  • [[:punct:]]* - 0+ punctuations chars
  • span - a literal substring
  • [[:punct:]]* - 0+ punctuations chars

R Demo:

words <- c("somethingspan.", "..span?", "spanthank", "great to hear", "yourspan")
words <- gsub("[[:punct:]]*span[[:punct:]]*", "", words) # Remove spans
words <- words[words != ""] # Discard empty elements
paste(words, collapse=" ") # Concat the elements
## => [1] "something thank great to hear your"

If there result whitespace only elements after removing unwanted strings, you may replace the second step with words <- words[trimws(words) != ""] (instead of words[words != ""]).

r code removing words containing @

You can use the following..

gsub('\\S+@\\S+', 'email', data)

Explanation:

\S matches any non-whitespace character. So here we match for any non-whitespace character (1 or more times) preceded by @ followed by any non-whitespace character (1 or more times)

How to remove  from scraped in text in R?

You could use library(stringr)

text <- "ALBANYÃ, OFFÃ, REBOUND BY"

library(stringr)
str_replace_all(text, "Ã,Â", "")
#> [1] "ALBANY OFF REBOUND BY"

or with gsub :

gsub("Ã,Â","",text)
#> [1] "ALBANY OFF REBOUND BY"

However, I think it is an encoding issue in the first place.
Moreover results of gsub or str_replace_all may difer with encoding, it could be why your text <- gsub(",", "", text) do not work.

You could check encoding with Encoding.

how to delete words from a list in a column in R

First read the data:

dat <- c("Lorem ipsum dolor",
"sit amet, consectetur adipiscing",
"elit, sed do eiusmod tempor",
"incididunt ut labore",
"et dolore magna aliqua.")
todelete <- c("Lorem", "dolore", "elit")

We can avoid loops with a little smart pasting. The | is an or so we can paste it in, allowing us to remove any loops:

gsub(paste0(todelete, collapse = "|"), "", dat)


Related Topics



Leave a reply



Submit