Replace multiple letters with accents with gsub
Use the character translation function
chartr("áéó", "aeo", mydata)
Replace multiple letters with accents with gsub
Use the character translation function
chartr("áéó", "aeo", mydata)
How to replace special characters with gsub in R?
If writing the characters as-is does not work, you can try using the unicode expression.
Here is the unicode expressions of the relevant letters from Wikipedia.
ş U+015F (351) https://en.wikipedia.org/wiki/%C5%9E
ţ U+0163 (355) https://en.wikipedia.org/wiki/%C5%A2
ș U+0219 (537) https://en.wikipedia.org/wiki/S-comma
ț U+021B (539) https://en.wikipedia.org/wiki/T-comma
You can do the conversion in R as below.Utf8ToInt
is convenient to verify that the letters are converted as intended.
x <- "ş__ţ"
utf8ToInt(x)
# 351 95 95 355
x2 <- gsub("\u015F", "\u0219", x)
utf8ToInt(x2)
# 537 95 95 355
x3 <- gsub("\u0163", "\u021B", x)
utf8ToInt(x3)
# 351 95 95 539
By the way, since this is letter-to-letter conversion, chartr
function is more efficient than gsub
because you can convert multiple pairs of letters at once like below.
x4 <- chartr("\u015F\u0163", "\u0219\u021B", x)
utf8ToInt(x4)
# 537 95 95 539
gsub() not recognizing and replacing certain accented characters
Use stringi::stri_trans_general
:
library(stringi)
df<-data.frame(Name=c("Stipe Miočić","Duško Todorović","Michał Oleksiejczuk","Jiři Prochazka","Bartosz Fabiński","Damir Hadžović","Ľudovit Klein","Diana Belbiţă","Joanna Jędrzejczyk" ))
stri_trans_general(df$Name, "Latin-ASCII")
Results:
[1] "Stipe Miocic" "Dusko Todorovic" "Michal Oleksiejczuk"
[4] "Jiri Prochazka" "Bartosz Fabinski" "Damir Hadzovic"
[7] "Ludovit Klein" "Diana Belbita" "Joanna Jedrzejczyk"
See R proof.
Using gsub with multiple conditions to substitute a list of words in a PDF
If it is a substring replacement, an option is a loop with gsub
. Create two vectors for the pattern and replacement (with same length), then loop over the sequence of the vector and do the replacement with gsub
and assign it to the same object
pat <- c("apple", "banana", "squash")
replace <- c("fruit", "fruit", "vegetable")
for(i in seq_along(pat)) x<- gsub(pat[i], replace[i], x)
If it is a fixed match, we don't need the gsub
as we can use a named vector to do the match and replace
x <- c("apple", "apple", "banana", "squash", "banana")
unname(setNames(replace, pat)[x])
#[1] "fruit" "fruit" "fruit" "vegetable" "fruit"
GSUB replace 3 or more repeating characters
When you enclose the whole pattern inside square brackets, you make it match a single char.
Your regexps mean:
[\/\\1{3,}]
- a single char,/
,\
,1
,{
,3
,,
or}
[\/\2+]
-/
,\u0002
char or+
[\/{3,}]
-/
,{
,3
,,
or}
You can use
s.gsub(/\/{3,}/, '//')
See the Ruby demo online.
Selectively replace - by . using gsub()
We can change the pattern
to replace the -
between the alphabets
df[] <- lapply(df, gsub, pattern = '([[:alpha:]]+)-([[:alpha:]]+)', replacement ="\\1.\\2")
df
# x y
#1 a.b [1-2]
#2 c.d (3-4)
#3 e.f [5-6)
Clean string using gsub and multiple conditions
You can use
trimws(gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE))
See the regex demo. Or, to also replace multiple whitespaces with a single space, use
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))
Details
(?:\w+::\w+(?:\(\))?|\p{L}+(?:[-'_$]\p{L}+)*)(*SKIP)(*F)
: match either of the two patterns:\w+::\w+(?:\(\))?
- 1+ word chars,::
, 1+ word chars and an optional()
substring|
- or\p{L}+
- one or more Unicode letters(?:[-'_$]\p{L}+)*
- 0+ repetitions of-
,'
,_
or$
and then 1+ Unicode letters
(*SKIP)(*F)
- omits and skips the match|
- or[^\p{L}\s]
- any char but a Unicode letter and whitespace
See the R demo:
myText <- c("# Re-loading pkgdown while it's running causes weird behaviour with # the context cache don't",
"# Needs to handle NA for desc::desc_get()",
'# Update href of toc anchors , use "-" instead "."',
"# Keep something$col or here_you::must_stay")
trimws(gsub("\\s{2,}", " ", gsub("(?:\\w+::\\w+(?:\\(\\))?|\\p{L}+(?:[-'_$]\\p{L}+)*)(*SKIP)(*F)|[^\\p{L}\\s]", "", myText, perl=TRUE)))
Output:
[1] "Re-loading pkgdown while it's running causes weird behaviour with the context cache don't"
[2] "Needs to handle NA for desc::desc_get()"
[3] "Update href of toc anchors use instead"
[4] "Keep something$col or here_you::must_stay"
Replace with gsub a regexp with accents
You could do something like the following:
def bold_string(str, search)
h = { "e" => "[eéê]", "a" => "[aáâ]" }
regex = search.gsub(/./) {|s| h.fetch(s, s)}
str.gsub(/(#{regex})/i, '<b>\1</b>')
end
Obviously this just shows you how to get started, you will need to fill h
with additional accented versions of characters.
Example: http://ideone.com/KukiKc
Related Topics
Save Multiple Ggplots Using a For Loop
Summarizing Multiple Columns With Data.Table
What Is the Width Argument in Position_Dodge
Plot Multiple Lines (Data Series) Each With Unique Color in R
Converting Multiple Columns from Character to Numeric Format in R
Dplyr: Nonstandard Column Names (White Space, Punctuation, Starts With Numbers)
Coalesce Two String Columns With Alternating Missing Values to One
Create a Co-Occurrence Matrix from Dummy-Coded Observations
Do.Call(Rbind, List) For Uneven Number of Column
Using Data.Table Package Inside My Own Package
Overlay Normal Curve to Histogram in R
How to Merge Color, Line Style and Shape Legends in Ggplot
How to Extract Plot Axes' Ranges For a Ggplot2 Object
Multiply Rows of Matrix by Vector
Find Which Season a Particular Date Belongs To