Converting Symbols, Accent Letters to English Alphabet

Converting Symbols, Accent Letters to English Alphabet

Reposting my post from How do I remove diacritics (accents) from a string in .NET?

This method works fine in java (purely for the purpose of removing diacritical marks aka accents).

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Pattern.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");
}

Replacing Accented Characters With Plain Alphabet Characters

It was an encoding issue. If I change the .java source file's encoding to UTF-8 instead of windows-1252 the code examples all work properly by outputting the expected text.

Is there a way to get rid of accents and convert a whole string to regular letters?

Use java.text.Normalizer to handle this for you.

string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction

This will separate all of the accent marks from the characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

string = string.replaceAll("[^\\p{ASCII}]", "");

If your text is in unicode, you should use this instead:

string = string.replaceAll("\\p{M}", "");

For unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.

Thanks to GarretWilson for the pointer and regular-expressions.info for the great unicode guide.

Convert special letters to english letters in R

You can use chartr

x <- "ØxxÅxx"
chartr("ØÅ", "OA", x)
[1] "OxxAxx"

And/or gsub

y <- "Æabc"
gsub("Æ", "AE", y)
[1] "AEabc"

How to convert accented letters to regular char in Java

Look at icu4j or the JDK 1.6 Normalizer:

public String removeAccents(String text) {
return Normalizer.normalize(text, Normalizer.Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

Converting A+ COMBINING ACUTE ACCENT to Á

>>> unicodedata.normalize('NFC', 'a\u0301')
'á'
>>> unicodedata.normalize('NFC', 'a\u0301').encode('unicode-escape')
b'\\xe1'

Change all accented letters to normal letters in C++

You should first define what you mean by "accented letters" what has to be done is largely different if what you have is say some extended 8 bits ASCII with a national codepage for codes above 128, or say some utf8 encoded string.

However you should have a look at libicu which provide what is necessary for good unicode based accented letters manipulation.

But it won't solve all problems for you. For instance what should you do if you get some chinese or russian letter ? What should you do if you get the Turkish uppercase I with point ? Remove the point on this "I" ? Doing so it would change the meaning of the text... etc. This kind of problems are endless with unicode. Even conventional sorting order depends of the country...



Related Topics



Leave a reply



Submit