Remove Multiple Patterns from Text Vector R

Removing multiple words from a string using a vector instead of regexp in R

We can use | to evaluate as a regex OR

library(stringr)
library(magrittr)
pat <- str_c(words, collapse="|")
"hello how are you" %>%
str_remove_all(pat) %>%
trimws
#[1] "are you"

data

words <- c("hello", "how")

How to delete multiple values from a vector?

The %in% operator tells you which elements are among the numers to remove:

> a <- sample (1 : 10)
> remove <- c (2, 3, 5)
> a
[1] 10 5 2 7 1 6 3 4 8 9
> a %in% remove
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
> a [! a %in% remove]
[1] 10 7 1 6 4 8 9

Note that this will silently remove incomparables (stuff like NA or Inf) as well (while it will keep duplicate values in a as long as they are not listed in remove).

  • If a can contain incomparables, but remove will not, we can use match, telling it to return 0 for non-matches and incomparables (%in% is a conventient shortcut for match):

    > a <- c (a, NA, Inf)
    > a
    [1] 10 5 2 7 1 6 3 4 8 9 NA Inf
    > match (a, remove, nomatch = 0L, incomparables = 0L)
    [1] 0 3 1 0 0 0 2 0 0 0 0 0
    > a [match (a, remove, nomatch = 0L, incomparables = 0L) == 0L]
    [1] 10 7 1 6 4 8 9 NA Inf

    incomparables = 0 is not needed as incomparables will anyways not match, but I'd include it for the sake of readability.

    This is, btw., what setdiff does internally (but without the unique to throw away duplicates in a which are not in remove).

  • If remove contains incomparables, you'll have to check for them individually, e.g.

    if (any (is.na (remove))) 
    a <- a [! is.na (a)]

    (This does not distinguish NA from NaN but the R manual anyways warns that one should not rely on having a difference between them)

    For Inf/ -Inf you'll have to check both sign and is.finite

how to remove multiple rows that match more than 1 pattern in R?

grep/grepl is not vectorized for pattern. Use | to combine them into a single string

custom_BGCs[!grepl(paste(c("Chloroflexota","Desulfobacterota_D",
"Gemmatimonadota"), collapse = "|"),custom_BGCs$Phylum),]

Looping over patterns list to remove them for a string column in R

You can collapse the patterns in one regex pattern and use str_remove_all to remove all the occurrences of it.

library(dplyr)
library(stringr)

ptrn <- paste0(patterns, collapse = '|')

df <- df %>% mutate(client_name = str_remove_all(client_name, ptrn))
df

# client_id client_name
#1 1 name
#2 2 name
#3 3 name
#4 4 name
#5 5 name
#6 6 name
#7 7 name
#8 8 name
#9 9 name

data

client_id <- 1:9 
client_name <- c("name5", "-name", "name--", "name-µ", "name²", "name31", "7name8", "name514", "²name8")
df <- data.frame(client_id, client_name)

Regex operator to remove multiple strings

Here is another regex:

gsub("^.*?(: |\\ |)", "", x) 

or

gsub("^.*?(:|\\|) ", "", x)

or

gsub("^.*?(:|\\|) ?", "", x) #if the vector contains mixed `:text`, `| text` without and with spaces
#output
[1] "AGE"
[2] "COUNTRY"
[3] "STATE, PROVINCE, COUNTY, ETC"
[4] "100 Grand Bar"
[5] "Anonymous brown globs that come in black and \norange wrappers\t(a.k.a. Mary Janes)"
[6] "Any full-sized candy bar"
[7] "Black Jacks"

^.*? - match the least amount of characters from the start of the string

(: |\\| ) - : or |

R, stringr - replace multiple characters from all elements of a vector with a single command

str_replace_all can take a vector of matches to replace:

str_replace_all(vec, c("X" = "", "Y" = "-"))
[1] "abc-def" "abc-def" "abc-def" "ghi-jkl" "ghi-jkl" "ghi-jkl"

Removing words featured in character vector from string

You could use the tm library for this:

require("tm")
removeWords(str,stopwords)
#[1] "I have "

Removing regular expressions from text string in a data-frame in R

The regex is failing because you need to escape all special characters. See the differences here:

# orig delimiters1=c('"', "\r\n", '-', '=', ';')
delimiters1=c('\\"', "\r\n", '-', '\\=', ';')

# orig delimiters2=c('*', ',', ':')
delimiters2=c('\\*', ',', '\\:')

For the str_replace_all() you need the words to be a single string separated by a | rather than a vector of 12

wordstoreplace <-
c('HAVELLS','Havells','Bajaj','BAJAJGrade A','PHILIPS',
'Philips',"MAKEBAJAJ/CG","philips","Philips/Grade A/Grade A/CG/GEPurchase","CG","Bajaj",
"BAJAJ") %>%
paste0(collapse = "|")
# "HAVELLS|Havells|Bajaj|BAJAJGrade A|PHILIPS|Philips|MAKEBAJAJ/CG|philips|Philips/Grade A/Grade A/CG/GEPurchase|CG|Bajaj|BAJAJ"

This then runs without throwing an error

dat1 <-
dat %>%
mutate(
x1 =
str_remove_all(x1, regex(str_c("\\b", wordstoremove, "\\b", collapse = "|"), ignore_case = T)),
x1 = str_replace_all(x1, wordstoreplace, "Grade A")
)


Related Topics



Leave a reply



Submit