Java. Ignore Accents When Comparing Strings

Compare strings ignoring accented characters

You can use java Collators for comparing the tests ignoring the accent, see a simple example:

import java.text.Collator;

/**
* @author Kennedy
*/
public class SimpleTest
{

public static void main(String[] args)
{
String a = "nocao";
String b = "noção";

final Collator instance = Collator.getInstance();

// This strategy mean it'll ignore the accents
instance.setStrength(Collator.NO_DECOMPOSITION);

// Will print 0 because its EQUAL
System.out.println(instance.compare(a, b));
}
}

Documentation: JavaDoc

I'll not explain in details because i used just a little of Collators and i'm not a expert in it, but you can google there's some articles about it.

Java. Ignore accents when comparing strings

I think you should be using the Collator class. It allows you to set a strength and locale and it will compare characters appropriately.

From the Java 1.6 API:

You can set a Collator's strength
property to determine the level of
difference considered significant in
comparisons. Four strengths are
provided: PRIMARY, SECONDARY,
TERTIARY, and IDENTICAL. The exact
assignment of strengths to language
features is locale dependant. For
example, in Czech, "e" and "f" are
considered primary differences, while
"e" and "ě" are secondary differences,
"e" and "E" are tertiary differences
and "e" and "e" are identical.

I think the important point here (which people are trying to make) is that "Joao"and "João" should never be considered as equal, but if you are doing sorting you don't want them to be compared based on their ASCII value because then you would have something like Joao, John, João, which is not good. Using the collator class definitely handles this correctly.

Java string searching ignoring accents

Make use of java.text.Normalizer and a shot of regex to get rid of the diacritics.

public static String removeDiacriticalMarks(String string) {
return Normalizer.normalize(string, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}

Which you can use as follows:

String value = "Joáo";
String comparisonMaterial = removeDiacriticalMarks(value); // Joao

Ignoring diacritic characters when comparing words with special characters (é, è, ...)

Check out this method in Java

private static final String PLAIN_ASCII = "AaEeIiOoUu" // grave
+ "AaEeIiOoUuYy" // acute
+ "AaEeIiOoUuYy" // circumflex
+ "AaOoNn" // tilde
+ "AaEeIiOoUuYy" // umlaut
+ "Aa" // ring
+ "Cc" // cedilla
+ "OoUu" // double acute
;

private static final String UNICODE = "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
+ "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
+ "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
+ "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
+ "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
+ "\u00C5\u00E5" + "\u00C7\u00E7" + "\u0150\u0151\u0170\u0171";

/**
* remove accented from a string and replace with ascii equivalent
*/
public static String removeAccents(String s) {
if (s == null)
return null;
StringBuilder sb = new StringBuilder(s.length());
int n = s.length();
int pos = -1;
char c;
boolean found = false;
for (int i = 0; i < n; i++) {
pos = -1;
c = s.charAt(i);
pos = (c <= 126) ? -1 : UNICODE.indexOf(c);
if (pos > -1) {
found = true;
sb.append(PLAIN_ASCII.charAt(pos));
} else {
sb.append(c);
}
}
if (!found) {
return s;
} else {
return sb.toString();
}
}

Ignoring accented letters in string comparison

FWIW, knightfor's answer below (as of this writing) should be the accepted answer.

Here's a function that strips diacritics from a string:

static string RemoveDiacritics(string text)
{
string formD = text.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();

foreach (char ch in formD)
{
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
if (uc != UnicodeCategory.NonSpacingMark)
{
sb.Append(ch);
}
}

return sb.ToString().Normalize(NormalizationForm.FormC);
}

More details on MichKap's blog (RIP...).

The principle is that is it turns 'é' into 2 successive chars 'e', acute.
It then iterates through the chars and skips the diacritics.

"héllo" becomes "he<acute>llo", which in turn becomes "hello".

Debug.Assert("hello"==RemoveDiacritics("héllo"));

Note: Here's a more compact .NET4+ friendly version of the same function:

static string RemoveDiacritics(string text)
{
return string.Concat(
text.Normalize(NormalizationForm.FormD)
.Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch)!=
UnicodeCategory.NonSpacingMark)
).Normalize(NormalizationForm.FormC);
}

Is there a way to get rid of accents and convert a whole string to regular letters?

Start with java.text.Normalizer.

string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction

This will separate all of the accent marks from most characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.

string = string.replaceAll("[^\\p{ASCII}]", "");

If your text is in Unicode, you should use this instead:

string = string.replaceAll("\\p{M}", "");

For Unicode, \\P{M} matches the base glyph and \\p{M} (lowercase) matches each accent.

Thanks to GarretWilson for the pointer and regular-expressions.info for the great Unicode guide.


It is important to note that Normalizer by itself is insufficient to remove diacritics. For example, the following will not replace the accented with the unaccented e:

import static java.text.Normalizer.normalize;
import static java.text.Normalizer.Form.*;

public class T {
public static void main( final String[] args ) {
final var text = "Brévis";

System.out.println(
normalize( text, NFD ) + " " +
normalize( text, NFC ) + " " +
normalize( text, NFKD ) + " " +
normalize( text, NFKC )
);
}
}

Easy way to remove accents from a Unicode string?

Finally, I've solved it by using the Normalizer class.

import java.text.Normalizer;

public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}

Comparing strings that contain accents in SQLite for Java doesn't work

UPPER handles only ASCII characters.

You should ensure that player names have consistent capitalization when searching them.
(Your code already assumes that spelling, name order, initials etc. are consistent.)

(And what happens when two players have the same name?)



Related Topics



Leave a reply



Submit