Replace accented characters in R with non-accented counterpart (UTF-8 encoding)
The below answers are basically taken from elsewhere. The key is getting your unwanted_array
in the right format. You might want it as a list
:
unwanted_array = list( 'Š'='S', 'š'='s', 'Ž'='Z', 'ž'='z', 'À'='A', 'Á'='A', 'Â'='A', 'Ã'='A', 'Ä'='A', 'Å'='A', 'Æ'='A', 'Ç'='C', 'È'='E', 'É'='E',
'Ê'='E', 'Ë'='E', 'Ì'='I', 'Í'='I', 'Î'='I', 'Ï'='I', 'Ñ'='N', 'Ò'='O', 'Ó'='O', 'Ô'='O', 'Õ'='O', 'Ö'='O', 'Ø'='O', 'Ù'='U',
'Ú'='U', 'Û'='U', 'Ü'='U', 'Ý'='Y', 'Þ'='B', 'ß'='Ss', 'à'='a', 'á'='a', 'â'='a', 'ã'='a', 'ä'='a', 'å'='a', 'æ'='a', 'ç'='c',
'è'='e', 'é'='e', 'ê'='e', 'ë'='e', 'ì'='i', 'í'='i', 'î'='i', 'ï'='i', 'ð'='o', 'ñ'='n', 'ò'='o', 'ó'='o', 'ô'='o', 'õ'='o',
'ö'='o', 'ø'='o', 'ù'='u', 'ú'='u', 'û'='u', 'ý'='y', 'ý'='y', 'þ'='b', 'ÿ'='y' )
You can do this easily with iconv
or chartr
:
> iconv(string, to='ASCII//TRANSLIT')
[1] "Holmer"
> chartr(paste(names(unwanted_array), collapse=''),
paste(unwanted_array, collapse=''),
string)
[1] "Holmer"
Otherwise you have to loop through all of replacements because mapply
or similar wouldn't account for symbols already replaced by previous gsub
operations.:
# the loop:
out <- string
for(i in seq_along(unwanted_array))
out <- gsub(names(unwanted_array)[i],unwanted_array[i],out)
The result:
> out
[1] "Holmer"
Replacing special characters from different encodings in r
This is an encoding problem. You may be able to fix it, but it's hard to know without the file. readBin
is a good bet if you can't force the proper encoding. Here is a summary of what I found:
I tried iconv
for the example string
iconv(a, "UTF-8", "WINDOWS-1252")
#[1] "Geidorf: Grabengürtel"
And it works, but you are right that something is up with "Ã"
iconv("Geidorf: Grabengürtel Ã", "UTF-8", "WINDOWS-1252")
#[1] NA
We can see which letters are problematic:
ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]
[1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã"
# corresponding characters
paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))])
[1] "Á" "Í" "Ï" "Ð" "Ý" "à"
The site you linked to has a relevant page, which spells out what the issue is:
Encoding Problem: Double Mis-Conversion
Symptom
With this particular double conversion, most characters display
correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D,
0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with
the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD
will show the problem. If you look at the I18nQA Encoding Debug Table
you can see that these characters in UTF-8 have second bytes ending in
one of the Unassigned Windows code points.Á Í Ï Ð Ý
"à" is a different case. You have mapped it to "Ã" when it should be "Ã\u00A0" or "Ã\xA0" or "Ã " (note that the space is not a normal space; it's a non-breaking space). So, fixing that in ist
takes care of one letter.
As for the other characters ("Á", "Í", "Ï", "Ð", and "Ý"), as is, they are all mapped to "Ã" in ist
, and you'll never be able to do the appropriate substitutions as long as that's true.
How can I easily replace special characters with rvest-friendly UTF-8 (hex.)
How about URLencode
in utils
? Here is how it works on your example:
> library(utils)
> URLencode("glaçage")
[1] "gla%E7age"
> z <- URLencode("glaçage")
> URLdecode(z)
[1] "glaçage"
Second example:
> URLencode("café")
[1] "caf%E9"
How do I remove diacritics (accents) from a string in .NET?
I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka
On the meaning of meaningless, aka All
Mn characters are non-spacing, but
some are more non-spacing than
others)
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder(capacity: normalizedString.Length);
for (int i = 0; i < normalizedString.Length; i++)
{
char c = normalizedString[i];
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder
.ToString()
.Normalize(NormalizationForm.FormC);
}
Note that this is a followup to his earlier post: Stripping diacritics....
The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.
Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.
How do I type accented characters in ASCII-encoded Rd files?
Rd
syntax is only LaTeX-like: it supports a limited set of macros, but these are not guaranteed to behave like their LaTeX counterparts, if any exist. Conversely, very few LaTeX macros have Rd
equivalents.
This section of the Writing R Extensions manual describes most of the built-in Rd
macros. Tables 1, 2, and 3 of this technical paper provide a complete list.
This section of WRE addresses encoding specifically.
Use of non-ASCII characters in Rd
files is supported, provided that you declare an appropriate encoding, either in the files themselves (via \enc
and \encoding
) or in DESCRIPTION
(via the Encoding
field). However, restriction to ASCII characters is encouraged:
Wherever possible, avoid non-ASCII chars in
Rd
files, and even symbols such as ‘<’, ‘>’, ‘$’, ‘^’, ‘&’, ‘|’, ‘@’, ‘~’, and ‘*’ outside ‘verbatim’ environments (since they may disappear in fonts designed to render text).
The recommended way to obtain non-ASCII characters in rendered help without including non-ASCII characters in your Rd
files is to use conditional text. \ifelse
allows you to provide raw LaTeX for PDF help pages, raw HTML for HTML help pages, and verbatim text for plain text help pages:
\ifelse{latex}{\out{\'{e}}}{\ifelse{html}{\out{é}}{e}}
That is quite verbose, so I would suggest defining your own macro(s) in man/macros/macros.Rd
. You can do this with \newcommand
and \renewcommand
:
\newcommand{\eacute}{\ifelse{latex}{\out{\'{e}}}{\ifelse{html}{\out{é}}{e}}}
Then you can use \eacute{}
freely in all of your Rd
files. To check that text is rendered the way you want in all formats, install the package and run help(topic, help_type=)
, with help_type
equal to "text"
, "html"
, or "pdf"
.
Replacing accented characters php
I have tried all sorts based on the variations listed in the answers, but the following worked:
$unwanted_array = array( 'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );
Related Topics
How to Group by All But One Columns
Ggplot2 - Shade Area Between Two Vertical Lines
R: Using Rgl to Generate 3D Rotatable Plots That Can Be Viewed in a Web Browser
How to Write a Function That Calls a Function That Calls Data.Table
Apply() Is Slow - How to Make It Faster or What Are My Alternatives
Generate Matrix with Iid Normal Random Variables Using R
Automated Ggplot2 Example Gallery in Knitr
Floating Point Arithmetic and Reproducibility
Conditionally Replacing Column Values with Data.Table
How to Merge Two Data.Table by Different Column Names
Dynamic Position for Ggplot2 Objects (Especially Geom_Text)
How to Open an .Xlsb File in R
Error When I Try to Predict Class Probabilities in R - Caret
How to Show the Progress of Code in Parallel Computation in R