Replace Accented Characters in R with Non-Accented Counterpart (Utf-8 Encoding)

Replace accented characters in R with non-accented counterpart (UTF-8 encoding)

The below answers are basically taken from elsewhere. The key is getting your unwanted_array in the right format. You might want it as a list:

unwanted_array = list(    'Š'='S', 'š'='s', 'Ž'='Z', 'ž'='z', 'À'='A', 'Á'='A', 'Â'='A', 'Ã'='A', 'Ä'='A', 'Å'='A', 'Æ'='A', 'Ç'='C', 'È'='E', 'É'='E',
'Ê'='E', 'Ë'='E', 'Ì'='I', 'Í'='I', 'Î'='I', 'Ï'='I', 'Ñ'='N', 'Ò'='O', 'Ó'='O', 'Ô'='O', 'Õ'='O', 'Ö'='O', 'Ø'='O', 'Ù'='U',
'Ú'='U', 'Û'='U', 'Ü'='U', 'Ý'='Y', 'Þ'='B', 'ß'='Ss', 'à'='a', 'á'='a', 'â'='a', 'ã'='a', 'ä'='a', 'å'='a', 'æ'='a', 'ç'='c',
'è'='e', 'é'='e', 'ê'='e', 'ë'='e', 'ì'='i', 'í'='i', 'î'='i', 'ï'='i', 'ð'='o', 'ñ'='n', 'ò'='o', 'ó'='o', 'ô'='o', 'õ'='o',
'ö'='o', 'ø'='o', 'ù'='u', 'ú'='u', 'û'='u', 'ý'='y', 'ý'='y', 'þ'='b', 'ÿ'='y' )

You can do this easily with iconv or chartr:

> iconv(string, to='ASCII//TRANSLIT')
[1] "Holmer"

> chartr(paste(names(unwanted_array), collapse=''),
paste(unwanted_array, collapse=''),
string)
[1] "Holmer"

Otherwise you have to loop through all of replacements because mapply or similar wouldn't account for symbols already replaced by previous gsub operations.:

# the loop:
out <- string
for(i in seq_along(unwanted_array))
out <- gsub(names(unwanted_array)[i],unwanted_array[i],out)

The result:

> out
[1] "Holmer"

Replacing special characters from different encodings in r

This is an encoding problem. You may be able to fix it, but it's hard to know without the file. readBin is a good bet if you can't force the proper encoding. Here is a summary of what I found:

I tried iconv for the example string

iconv(a, "UTF-8", "WINDOWS-1252")
#[1] "Geidorf: Grabengürtel"

And it works, but you are right that something is up with "Ã"

iconv("Geidorf: Grabengürtel Ã", "UTF-8", "WINDOWS-1252")
#[1] NA

We can see which letters are problematic:

ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]
[1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã"

# corresponding characters
paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))])
[1] "Á" "Í" "Ï" "Ð" "Ý" "à"

The site you linked to has a relevant page, which spells out what the issue is:

Encoding Problem: Double Mis-Conversion

Symptom

With this particular double conversion, most characters display
correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D,
0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with
the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD
will show the problem. If you look at the I18nQA Encoding Debug Table
you can see that these characters in UTF-8 have second bytes ending in
one of the Unassigned Windows code points.

Á Í Ï Ð Ý


"à" is a different case. You have mapped it to "Ã" when it should be "Ã\u00A0" or "Ã\xA0" or "à" (note that the space is not a normal space; it's a non-breaking space). So, fixing that in ist takes care of one letter.

As for the other characters ("Á", "Í", "Ï", "Ð", and "Ý"), as is, they are all mapped to "Ã" in ist, and you'll never be able to do the appropriate substitutions as long as that's true.

How can I easily replace special characters with rvest-friendly UTF-8 (hex.)

How about URLencode in utils? Here is how it works on your example:

> library(utils)
> URLencode("glaçage")
[1] "gla%E7age"
> z <- URLencode("glaçage")
> URLdecode(z)
[1] "glaçage"

Second example:

> URLencode("café")
[1] "caf%E9"

How do I remove diacritics (accents) from a string in .NET?

I've not used this method, but Michael Kaplan describes a method for doing so in his blog post (with a confusing title) that talks about stripping diacritics: Stripping is an interesting job (aka
On the meaning of meaningless, aka All
Mn characters are non-spacing, but
some are more non-spacing than
others)

static string RemoveDiacritics(string text) 
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder(capacity: normalizedString.Length);

for (int i = 0; i < normalizedString.Length; i++)
{
char c = normalizedString[i];
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}

return stringBuilder
.ToString()
.Normalize(NormalizationForm.FormC);
}

Note that this is a followup to his earlier post: Stripping diacritics....

The approach uses String.Normalize to split the input string into constituent glyphs (basically separating the "base" characters from the diacritics) and then scans the result and retains only the base characters. It's just a little complicated, but really you're looking at a complicated problem.

Of course, if you're limiting yourself to French, you could probably get away with the simple table-based approach in How to remove accents and tilde in a C++ std::string, as recommended by @David Dibben.

How do I type accented characters in ASCII-encoded Rd files?

Rd syntax is only LaTeX-like: it supports a limited set of macros, but these are not guaranteed to behave like their LaTeX counterparts, if any exist. Conversely, very few LaTeX macros have Rd equivalents.

This section of the Writing R Extensions manual describes most of the built-in Rd macros. Tables 1, 2, and 3 of this technical paper provide a complete list.

This section of WRE addresses encoding specifically.
Use of non-ASCII characters in Rd files is supported, provided that you declare an appropriate encoding, either in the files themselves (via \enc and \encoding) or in DESCRIPTION (via the Encoding field). However, restriction to ASCII characters is encouraged:

Wherever possible, avoid non-ASCII chars in Rd files, and even symbols such as ‘<’, ‘>’, ‘$’, ‘^’, ‘&’, ‘|’, ‘@’, ‘~’, and ‘*’ outside ‘verbatim’ environments (since they may disappear in fonts designed to render text).

The recommended way to obtain non-ASCII characters in rendered help without including non-ASCII characters in your Rd files is to use conditional text. \ifelse allows you to provide raw LaTeX for PDF help pages, raw HTML for HTML help pages, and verbatim text for plain text help pages:

\ifelse{latex}{\out{\'{e}}}{\ifelse{html}{\out{é}}{e}}

That is quite verbose, so I would suggest defining your own macro(s) in man/macros/macros.Rd. You can do this with \newcommand and \renewcommand:

\newcommand{\eacute}{\ifelse{latex}{\out{\'{e}}}{\ifelse{html}{\out{é}}{e}}}

Then you can use \eacute{} freely in all of your Rd files. To check that text is rendered the way you want in all formats, install the package and run help(topic, help_type=), with help_type equal to "text", "html", or "pdf".

Replacing accented characters php

I have tried all sorts based on the variations listed in the answers, but the following worked:

$unwanted_array = array(    'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss', 'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c',
'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 'õ'=>'o',
'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'þ'=>'b', 'ÿ'=>'y' );
$str = strtr( $str, $unwanted_array );


Related Topics



Leave a reply



Submit