Removing multiple words from a string using a vector instead of regexp in R
We can use |
to evaluate as a regex OR
library(stringr)
library(magrittr)
pat <- str_c(words, collapse="|")
"hello how are you" %>%
str_remove_all(pat) %>%
trimws
#[1] "are you"
data
words <- c("hello", "how")
How to delete multiple values from a vector?
The %in%
operator tells you which elements are among the numers to remove:
> a <- sample (1 : 10)
> remove <- c (2, 3, 5)
> a
[1] 10 5 2 7 1 6 3 4 8 9
> a %in% remove
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
> a [! a %in% remove]
[1] 10 7 1 6 4 8 9
Note that this will silently remove incomparables (stuff like NA
or Inf)
as well (while it will keep duplicate values in a
as long as they are not listed in remove
).
If
a
can contain incomparables, butremove
will not, we can usematch
, telling it to return0
for non-matches and incomparables (%in%
is a conventient shortcut formatch
):> a <- c (a, NA, Inf)
> a
[1] 10 5 2 7 1 6 3 4 8 9 NA Inf
> match (a, remove, nomatch = 0L, incomparables = 0L)
[1] 0 3 1 0 0 0 2 0 0 0 0 0
> a [match (a, remove, nomatch = 0L, incomparables = 0L) == 0L]
[1] 10 7 1 6 4 8 9 NA Infincomparables = 0
is not needed as incomparables will anyways not match, but I'd include it for the sake of readability.
This is, btw., whatsetdiff
does internally (but without theunique
to throw away duplicates ina
which are not inremove
).If
remove
contains incomparables, you'll have to check for them individually, e.g.if (any (is.na (remove)))
a <- a [! is.na (a)](This does not distinguish
NA
fromNaN
but the R manual anyways warns that one should not rely on having a difference between them)For
Inf
/-Inf
you'll have to check bothsign
andis.finite
how to remove multiple rows that match more than 1 pattern in R?
grep/grepl
is not vectorized for pattern. Use |
to combine them into a single string
custom_BGCs[!grepl(paste(c("Chloroflexota","Desulfobacterota_D",
"Gemmatimonadota"), collapse = "|"),custom_BGCs$Phylum),]
Looping over patterns list to remove them for a string column in R
You can collapse the patterns
in one regex pattern and use str_remove_all
to remove all the occurrences of it.
library(dplyr)
library(stringr)
ptrn <- paste0(patterns, collapse = '|')
df <- df %>% mutate(client_name = str_remove_all(client_name, ptrn))
df
# client_id client_name
#1 1 name
#2 2 name
#3 3 name
#4 4 name
#5 5 name
#6 6 name
#7 7 name
#8 8 name
#9 9 name
data
client_id <- 1:9
client_name <- c("name5", "-name", "name--", "name-µ", "name²", "name31", "7name8", "name514", "²name8")
df <- data.frame(client_id, client_name)
Regex operator to remove multiple strings
Here is another regex:
gsub("^.*?(: |\\ |)", "", x)
or
gsub("^.*?(:|\\|) ", "", x)
or
gsub("^.*?(:|\\|) ?", "", x) #if the vector contains mixed `:text`, `| text` without and with spaces
#output
[1] "AGE"
[2] "COUNTRY"
[3] "STATE, PROVINCE, COUNTY, ETC"
[4] "100 Grand Bar"
[5] "Anonymous brown globs that come in black and \norange wrappers\t(a.k.a. Mary Janes)"
[6] "Any full-sized candy bar"
[7] "Black Jacks"
^.*?
- match the least amount of characters from the start of the string(: |\\| )
- :
or |
R, stringr - replace multiple characters from all elements of a vector with a single command
str_replace_all
can take a vector of matches to replace:
str_replace_all(vec, c("X" = "", "Y" = "-"))
[1] "abc-def" "abc-def" "abc-def" "ghi-jkl" "ghi-jkl" "ghi-jkl"
Removing words featured in character vector from string
You could use the tm
library for this:
require("tm")
removeWords(str,stopwords)
#[1] "I have "
Removing regular expressions from text string in a data-frame in R
The regex is failing because you need to escape all special characters. See the differences here:
# orig delimiters1=c('"', "\r\n", '-', '=', ';')
delimiters1=c('\\"', "\r\n", '-', '\\=', ';')
# orig delimiters2=c('*', ',', ':')
delimiters2=c('\\*', ',', '\\:')
For the str_replace_all()
you need the words to be a single string separated by a |
rather than a vector of 12
wordstoreplace <-
c('HAVELLS','Havells','Bajaj','BAJAJGrade A','PHILIPS',
'Philips',"MAKEBAJAJ/CG","philips","Philips/Grade A/Grade A/CG/GEPurchase","CG","Bajaj",
"BAJAJ") %>%
paste0(collapse = "|")
# "HAVELLS|Havells|Bajaj|BAJAJGrade A|PHILIPS|Philips|MAKEBAJAJ/CG|philips|Philips/Grade A/Grade A/CG/GEPurchase|CG|Bajaj|BAJAJ"
This then runs without throwing an error
dat1 <-
dat %>%
mutate(
x1 =
str_remove_all(x1, regex(str_c("\\b", wordstoremove, "\\b", collapse = "|"), ignore_case = T)),
x1 = str_replace_all(x1, wordstoreplace, "Grade A")
)
Related Topics
Highcharter Plotbands, Plotlines with Time Series Data
R: Generating All Permutations of N Weights in Multiples of P
Display Error Instead of Plot in Shiny Web App
How to Averaging Over a Time Period by Hours
Condition Filter in Dplyr Based on Shiny Input
Setting an Individual Color Palette for the Group Variable in Geom_Smooth
Plot Emojis/Emoticons in R with Ggplot
Transfer Data from Database to Spark Using Sparklyr
Math of Tm::Findassocs How Does This Function Work
How Does R Handle Object in Function Call
Highlight Minimum and Maximum Points in Faceted Ggplot2 Graph in R
Using Override.Aes() in Ggplot2 with Layered Symbols (R)
Split Column in Data.Table to Multiple Rows
Subsetting R Array: Dimension Lost When Its Length Is 1
Apply a Function to All Variables Starting with Specific Pattern in R
How to Get a List of All Possible Partitions of a Vector in R