Replace Multiple Letters With Accents With Gsub

Replace multiple letters with accents with gsub

Use the character translation function

chartr("áéó", "aeo", mydata)

Replace multiple letters with accents with gsub

Use the character translation function

chartr("áéó", "aeo", mydata)

How to replace special characters with gsub in R?

If writing the characters as-is does not work, you can try using the unicode expression.
Here is the unicode expressions of the relevant letters from Wikipedia.

ş  U+015F (351)  https://en.wikipedia.org/wiki/%C5%9E
ţ U+0163 (355) https://en.wikipedia.org/wiki/%C5%A2

ș U+0219 (537) https://en.wikipedia.org/wiki/S-comma
ț U+021B (539) https://en.wikipedia.org/wiki/T-comma

You can do the conversion in R as below.
Utf8ToInt is convenient to verify that the letters are converted as intended.

x <- "ş__ţ"
utf8ToInt(x)
# 351 95 95 355

x2 <- gsub("\u015F", "\u0219", x)
utf8ToInt(x2)
# 537 95 95 355

x3 <- gsub("\u0163", "\u021B", x)
utf8ToInt(x3)
# 351 95 95 539

By the way, since this is letter-to-letter conversion, chartr function is more efficient than gsub because you can convert multiple pairs of letters at once like below.

x4 <- chartr("\u015F\u0163", "\u0219\u021B", x)
utf8ToInt(x4)
# 537 95 95 539

gsub() not recognizing and replacing certain accented characters

Use stringi::stri_trans_general:

library(stringi)
df<-data.frame(Name=c("Stipe Miočić","Duško Todorović","Michał Oleksiejczuk","Jiři Prochazka","Bartosz Fabiński","Damir Hadžović","Ľudovit Klein","Diana Belbiţă","Joanna Jędrzejczyk" ))
stri_trans_general(df$Name, "Latin-ASCII")

Results:

[1] "Stipe Miocic"        "Dusko Todorovic"     "Michal Oleksiejczuk"
[4] "Jiri Prochazka" "Bartosz Fabinski" "Damir Hadzovic"
[7] "Ludovit Klein" "Diana Belbita" "Joanna Jedrzejczyk"

See R proof.

Using gsub with multiple conditions to substitute a list of words in a PDF

If it is a substring replacement, an option is a loop with gsub. Create two vectors for the pattern and replacement (with same length), then loop over the sequence of the vector and do the replacement with gsub and assign it to the same object

pat <- c("apple", "banana", "squash")
replace <- c("fruit", "fruit", "vegetable")
for(i in seq_along(pat)) x<- gsub(pat[i], replace[i], x)

If it is a fixed match, we don't need the gsub as we can use a named vector to do the match and replace

x <- c("apple", "apple", "banana", "squash", "banana")
unname(setNames(replace, pat)[x])
#[1] "fruit" "fruit" "fruit" "vegetable" "fruit"

GSUB replace 3 or more repeating characters

When you enclose the whole pattern inside square brackets, you make it match a single char.

Your regexps mean:

  • [\/\\1{3,}] - a single char, /, \, 1, {, 3, , or }
  • [\/\2+] - /, \u0002 char or +
  • [\/{3,}] - /, {, 3, , or }

You can use

s.gsub(/\/{3,}/, '//')

See the Ruby demo online.

Selectively replace - by . using gsub()

We can change the pattern to replace the - between the alphabets

df[] <- lapply(df, gsub, pattern = '([[:alpha:]]+)-([[:alpha:]]+)',  replacement ="\\1.\\2")
df
# x y
#1 a.b [1-2]
#2 c.d (3-4)
#3 e.f [5-6)

Clean string using gsub and multiple conditions

You can use

trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))

See the regex demo. Or, to also replace multiple whitespaces with a single space, use

trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))

Details

  • (?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F): match either of the two patterns:
    • \w+::\w+(?:\(\))? - 1+ word chars, ::, 1+ word chars and an optional () substring
    • | - or
    • \p{L}+ - one or more Unicode letters
    • (?:[-'_$]\p{L}+)* - 0+ repetitions of -, ', _ or $ and then 1+ Unicode letters
  • (*SKIP)(*F) - omits and skips the match
  • | - or
  • [^\p{L}\s] - any char but a Unicode letter and whitespace

See the R demo:

myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))

Output:

[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"
[3] "Update href of toc anchors use instead"
[4] "Keep something$col or here_you::must_stay"

Replace with gsub a regexp with accents

You could do something like the following:

def bold_string(str, search)
h = { "e" => "[eéê]", "a" => "[aáâ]" }
regex = search.gsub(/./) {|s| h.fetch(s, s)}
str.gsub(/(#{regex})/i, '<b>\1</b>')
end

Obviously this just shows you how to get started, you will need to fill h with additional accented versions of characters.

Example: http://ideone.com/KukiKc



Related Topics



Leave a reply



Submit