Compare strings ignoring accented characters
You can use java Collators for comparing the tests ignoring the accent, see a simple example:
import java.text.Collator;
/**
* @author Kennedy
*/
public class SimpleTest
{
public static void main(String[] args)
{
String a = "nocao";
String b = "noção";
final Collator instance = Collator.getInstance();
// This strategy mean it'll ignore the accents
instance.setStrength(Collator.NO_DECOMPOSITION);
// Will print 0 because its EQUAL
System.out.println(instance.compare(a, b));
}
}
Documentation: JavaDoc
I'll not explain in details because i used just a little of Collators and i'm not a expert in it, but you can google there's some articles about it.
Java. Ignore accents when comparing strings
I think you should be using the Collator class. It allows you to set a strength and locale and it will compare characters appropriately.
From the Java 1.6 API:
You can set a Collator's strength
property to determine the level of
difference considered significant in
comparisons. Four strengths are
provided: PRIMARY, SECONDARY,
TERTIARY, and IDENTICAL. The exact
assignment of strengths to language
features is locale dependant. For
example, in Czech, "e" and "f" are
considered primary differences, while
"e" and "ě" are secondary differences,
"e" and "E" are tertiary differences
and "e" and "e" are identical.
I think the important point here (which people are trying to make) is that "Joao"and "João" should never be considered as equal, but if you are doing sorting you don't want them to be compared based on their ASCII value because then you would have something like Joao, John, João, which is not good. Using the collator class definitely handles this correctly.
Java string searching ignoring accents
Make use of java.text.Normalizer
and a shot of regex to get rid of the diacritics.
public static String removeDiacriticalMarks(String string) {
return Normalizer.normalize(string, Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
}
Which you can use as follows:
String value = "Joáo";
String comparisonMaterial = removeDiacriticalMarks(value); // Joao
Ignoring diacritic characters when comparing words with special characters (é, è, ...)
Check out this method in Java
private static final String PLAIN_ASCII = "AaEeIiOoUu" // grave
+ "AaEeIiOoUuYy" // acute
+ "AaEeIiOoUuYy" // circumflex
+ "AaOoNn" // tilde
+ "AaEeIiOoUuYy" // umlaut
+ "Aa" // ring
+ "Cc" // cedilla
+ "OoUu" // double acute
;
private static final String UNICODE = "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"
+ "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD"
+ "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177"
+ "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
+ "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF"
+ "\u00C5\u00E5" + "\u00C7\u00E7" + "\u0150\u0151\u0170\u0171";
/**
* remove accented from a string and replace with ascii equivalent
*/
public static String removeAccents(String s) {
if (s == null)
return null;
StringBuilder sb = new StringBuilder(s.length());
int n = s.length();
int pos = -1;
char c;
boolean found = false;
for (int i = 0; i < n; i++) {
pos = -1;
c = s.charAt(i);
pos = (c <= 126) ? -1 : UNICODE.indexOf(c);
if (pos > -1) {
found = true;
sb.append(PLAIN_ASCII.charAt(pos));
} else {
sb.append(c);
}
}
if (!found) {
return s;
} else {
return sb.toString();
}
}
Ignoring accented letters in string comparison
FWIW, knightfor's answer below (as of this writing) should be the accepted answer.
Here's a function that strips diacritics from a string:
static string RemoveDiacritics(string text)
{
string formD = text.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
foreach (char ch in formD)
{
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
if (uc != UnicodeCategory.NonSpacingMark)
{
sb.Append(ch);
}
}
return sb.ToString().Normalize(NormalizationForm.FormC);
}
More details on MichKap's blog (RIP...).
The principle is that is it turns 'é' into 2 successive chars 'e', acute.
It then iterates through the chars and skips the diacritics.
"héllo" becomes "he<acute>llo", which in turn becomes "hello".
Debug.Assert("hello"==RemoveDiacritics("héllo"));
Note: Here's a more compact .NET4+ friendly version of the same function:
static string RemoveDiacritics(string text)
{
return string.Concat(
text.Normalize(NormalizationForm.FormD)
.Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch)!=
UnicodeCategory.NonSpacingMark)
).Normalize(NormalizationForm.FormC);
}
Is there a way to get rid of accents and convert a whole string to regular letters?
Start with java.text.Normalizer
.
string = Normalizer.normalize(string, Normalizer.Form.NFD);
// or Normalizer.Form.NFKD for a more "compatible" deconstruction
This will separate all of the accent marks from most characters. Then, you just need to compare each character against being a letter and throw out the ones that aren't.
string = string.replaceAll("[^\\p{ASCII}]", "");
If your text is in Unicode, you should use this instead:
string = string.replaceAll("\\p{M}", "");
For Unicode, \\P{M}
matches the base glyph and \\p{M}
(lowercase) matches each accent.
Thanks to GarretWilson for the pointer and regular-expressions.info for the great Unicode guide.
It is important to note that Normalizer
by itself is insufficient to remove diacritics. For example, the following will not replace the accented é
with the unaccented e
:
import static java.text.Normalizer.normalize;
import static java.text.Normalizer.Form.*;
public class T {
public static void main( final String[] args ) {
final var text = "Brévis";
System.out.println(
normalize( text, NFD ) + " " +
normalize( text, NFC ) + " " +
normalize( text, NFKD ) + " " +
normalize( text, NFKC )
);
}
}
Easy way to remove accents from a Unicode string?
Finally, I've solved it by using the Normalizer
class.
import java.text.Normalizer;
public static String stripAccents(String s)
{
s = Normalizer.normalize(s, Normalizer.Form.NFD);
s = s.replaceAll("[\\p{InCombiningDiacriticalMarks}]", "");
return s;
}
Comparing strings that contain accents in SQLite for Java doesn't work
UPPER handles only ASCII characters.
You should ensure that player names have consistent capitalization when searching them.
(Your code already assumes that spelling, name order, initials etc. are consistent.)
(And what happens when two players have the same name?)
Related Topics
How to Deal with Maven-3 Timestamped Snapshots Efficiently
How to Convert a String to a Secretkey
Spring Cron Expression for Every Day 1:01:Am
How to Efficiently Remove All Null Elements from a Arraylist or String Array
How to Convert a Date to Milliseconds
Configure Datasource Programmatically in Spring Boot
Javafx: "Toolkit" Not Initialized When Trying to Play an Mp3 File Through Mediaplayer Class
How Does Java Order Items in a Hashmap or a Hashtable
How to Capture Https with Fiddler, in Java
How to Encode Url to Avoid Special Characters in Java
Equivalent of Waitforvisible/Waitforelementpresent in Selenium Webdriver Tests Using Java
Allowing the "Enter" Key to Press the Submit Button, as Opposed to Only Using Mouseclick