Remove a list of whole words that may contain special chars from a character vector without matching parts of words
You may use a PCRE regex with a gsub
base R function (it will also work with ICU regex in str_replace_all
):
\s*(?<!\w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)(?!\w)
See the regex demo.
Details
\s*
- 0 or more whitespaces(?<!\w)
- a negative lookbehind that ensures there is no word char immediately before the current location(?:at|ax|CL|OZ|Gm|Kg|C100|-1\.00)
- a non-capturing group containing the escaped items inside the character vector with the words you need to remove(?!\w)
- a negative lookahead that ensures there is no word char immediately after the current location.
NOTE: We cannot use \b
word boundary here because the items in the myList
character vector may start/end with non-word characters while \b
meaning is context-dependent.
See an R demo online:
myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
pat <- paste0("\\s*(?<!\\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\\w)")
cat(pat, collapse="\n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."
Details
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\\])", "\\\\\\1", s)) }
- escapes all special chars that need escaping in a PCRE patternpaste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")
- creats a|
-separated alternative list from the search term vector.
remove all words from a character vector that are NOT certain words
Consider the following vector:
v <- c("Final data QAQC done on CSF 1-7561",
"CIF 1-229",
"SEEF 1-68",
"CRT 1-19",
"082015-HOBA-G17-1 changed to offPlot based on GIS review of searched area")
You could do:
## vector with words to match
cond <- c("CSF", "CIF", "SEEF", "CRT")
## regex that captures digits and tolerates dashes (-)
reg <- "(\\d+-?)+$"
## pattern to match either words or regex
pattern <- paste(c(cond, reg), collapse = "|")
Then use stri_extract_all()
from the stringi
package:
library(stringi)
stri_extract_all_regex(v, pattern)
Which gives:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#[1] NA
As per mentionned by @akrun, you could also do:
regmatches(v, gregexpr(pattern, v))
Which gives:
#[[1]]
#[1] "CSF" "1-7561"
#
#[[2]]
#[1] "CIF" "1-229"
#
#[[3]]
#[1] "SEEF" "1-68"
#
#[[4]]
#[1] "CRT" "1-19"
#
#[[5]]
#character(0)
Removing words featured in character vector from string
You could use the tm
library for this:
require("tm")
removeWords(str,stopwords)
#[1] "I have "
How to use R, stringr or other package to replace a group of words from a long string?
You can do:
gsub(paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?"), "", a)
# [1] "fda afe faref fae faef afef absolute fgprg"
\\b
indicates the word boundary and with |
we match several possible words. ( )?
checks whether there is a space afterwards and removes that as well.
So we are matching the following expression in gsub
:
paste0("\\b", paste0(b$words, collapse = "\\b( )?|\\b"), "\\b( )?")
# [1] "\\ba\\b( )?|\\babout\\b( )?|\\bacross\\b( )?"
Or with stringr
:
library(stringr)
str_replace_all(a, str_c("\\b", str_c(b$words, collapse = "\\b( )?|"), "\\b( )?"), "")
Remove strings found in vector 1, from vector 2
Try this,
sample1 <- c(".aaa", ".aarp", ".abb", ".abbott", ".abogado")
sample2 <- c("try1.aarp", "www.tryagain.aaa", "255.255.255.255", "onemoretry.abb.abogado")
paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b")
# [1] "(\\.aaa|\\.aarp|\\.abb|\\.abbott|\\.abogado)\\b"
gsub(paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b"), "", sample2)
# [1] "try1" "www.tryagain" "255.255.255.255" "onemoretry"
Explanation:
sub("\\.", "\\\\.", sample1)
escapes all the dots. Since dots are special chars in regex.paste(sub("\\.", "\\\\.", sample1), collapse="|")
combines all the elements with|
as delimiter.paste0("(",paste(sub("\\.", "\\\\.", sample1), collapse="|"),")\\b")
creates a regex like all the elements present inside a capturing group followed by a word boundary.\\b
is a much needed one here . So that it would do an exact word match.
R: removing part of the word in a character string
You may use
[[:punct:]]*span[[:punct:]]*
See the regex demo.
Details
[[:punct:]]*
- 0+ punctuations charsspan
- a literal substring[[:punct:]]*
- 0+ punctuations chars
R Demo:
words <- c("somethingspan.", "..span?", "spanthank", "great to hear", "yourspan")
words <- gsub("[[:punct:]]*span[[:punct:]]*", "", words) # Remove spans
words <- words[words != ""] # Discard empty elements
paste(words, collapse=" ") # Concat the elements
## => [1] "something thank great to hear your"
If there result whitespace only elements after removing unwanted strings, you may replace the second step with words <- words[trimws(words) != ""]
(instead of words[words != ""]
).
r code removing words containing @
You can use the following..
gsub('\\S+@\\S+', 'email', data)
Explanation:
\S
matches any non-whitespace character. So here we match for any non-whitespace character (1
or more times) preceded by @
followed by any non-whitespace character (1
or more times)
How to remove  from scraped in text in R?
You could use library(stringr)
text <- "ALBANYÃ, OFFÃ, REBOUND BY"
library(stringr)
str_replace_all(text, "Ã,Â", "")
#> [1] "ALBANY OFF REBOUND BY"
or with gsub
:
gsub("Ã,Â","",text)
#> [1] "ALBANY OFF REBOUND BY"
However, I think it is an encoding issue in the first place.
Moreover results of gsub
or str_replace_all
may difer with encoding, it could be why your text <- gsub(",", "", text)
do not work.
You could check encoding with Encoding
.
how to delete words from a list in a column in R
First read the data:
dat <- c("Lorem ipsum dolor",
"sit amet, consectetur adipiscing",
"elit, sed do eiusmod tempor",
"incididunt ut labore",
"et dolore magna aliqua.")
todelete <- c("Lorem", "dolore", "elit")
We can avoid loops with a little smart pasting. The |
is an or so we can paste it in, allowing us to remove any loops:
gsub(paste0(todelete, collapse = "|"), "", dat)
Related Topics
Combining Rows Based on a Column
Data.Frames in R: Name Autocompletion
How to Convert a Numeric Value into a Date Value
R: How to Retrieve a Column Name of a Data Frame
Check Which Elements of a Vector Is Between the Elements of Another One in R
Object 'C_Stri_Join' Not Found - Using Knitr in Rstudio
Mass Variable Declaration and Assignment in R
Converting Multiple Existing Xts Objects to Multiple Data.Frames
How to Decode Postgresql Bytea Column Hex to Int16/Uint16 in R
Importing Multiple .CSV Files with Variable Column Types into R
Technique for Finding Bad Data in Read.CSV in R
How to Add New Calculated Variables to a Data Frame
Arranging Ggally Plots with Gridextra
Duplicate Couples (Id-Time) Error in Plm with Only Two Ids