Ignoring accented letters in string comparison
EDIT 2012-01-20: Oh boy! The solution was so much simpler and has been in the framework nearly forever. As pointed out by knightpfhor :
string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace);
Here's a function that strips diacritics from a string:
static string RemoveDiacritics(string text)
{
string formD = text.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
foreach (char ch in formD)
{
UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
if (uc != UnicodeCategory.NonSpacingMark)
{
sb.Append(ch);
}
}
return sb.ToString().Normalize(NormalizationForm.FormC);
}
More details on MichKap's blog (RIP...).
The principle is that is it turns 'é' into 2 successive chars 'e', acute.
It then iterates through the chars and skips the diacritics.
"héllo" becomes "he<acute>llo", which in turn becomes "hello".
Debug.Assert("hello"==RemoveDiacritics("héllo"));
Note: Here's a more compact .NET4+ friendly version of the same function:
static string RemoveDiacritics(string text)
{
return string.Concat(
text.Normalize(NormalizationForm.FormD)
.Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch)!=
UnicodeCategory.NonSpacingMark)
).Normalize(NormalizationForm.FormC);
}
Compare strings ignoring accented characters
You can use java Collators for comparing the tests ignoring the accent, see a simple example:
import java.text.Collator;
/**
* @author Kennedy
*/
public class SimpleTest
{
public static void main(String[] args)
{
String a = "nocao";
String b = "noção";
final Collator instance = Collator.getInstance();
// This strategy mean it'll ignore the accents
instance.setStrength(Collator.NO_DECOMPOSITION);
// Will print 0 because its EQUAL
System.out.println(instance.compare(a, b));
}
}
Documentation: JavaDoc
I'll not explain in details because i used just a little of Collators and i'm not a expert in it, but you can google there's some articles about it.
Ignore accent letters in string comparison in Visual Studio
One of the possible answers is to use the RemoveDiacritcs approach.
static string RemoveDiacritics(string text)
{
var normalizedString = text.Normalize(NormalizationForm.FormD);
var stringBuilder = new StringBuilder();
foreach (var c in normalizedString)
{
var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
if (unicodeCategory != UnicodeCategory.NonSpacingMark)
{
stringBuilder.Append(c);
}
}
return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}
More info here : How do I remove diacritics (accents) from a string in .NET?
How can I do String.StartsWith ignoring diacritics/accents/tildes?
You could try utilizing the System.Globalization.CompareInfo.IsPrefix
method, which accepts a CompareOptions
enum. See the docs here.
Besides this, you can try to manually implement it for youself based on for example this answer, which deals with removing diacritics from a string.
Compare two string and ignore (but not replace) accents. PHP
Just convert the accents to their non-accented counter part and then compare strings. The function in my answer will remove the accents for you.
function removeAccents($string) {
return strtolower(trim(preg_replace('~[^0-9a-z]+~i', '-', preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8'))), ' '));
}
$a = "joaoaaeeA";
$b = "joãoâàéèÀ";
var_dump(removeAccents($a) === removeAccents($b));
Output:
bool(true)
Demo
How to compare strings with case insensitive and accent insensitive
To ignore both case AND accents, you can use string.Compare()
with both the IgnoreNonSpace
AND the IgnoreCase
options, like so:
string s1 = "http://www.buroteknik.com/metylan-c387c4b0ft-tarafli-bant-12cm-x25mt_154202.html";
string s2 = "http://www.buroteknik.com/METYLAN-C387C4B0FT-TARAFLI-BANT-12cm-x25mt_154202.html";
string s3 = "http://www.buroteknik.com/METYLAN-C387C4B0FT-TARAFLı-BANT-12cm-x25mt_154202.html";
Console.WriteLine(string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase));
Console.WriteLine(string.Compare(s2, s3, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase));
In response to your comments below, this works for tarafli
and TARAFLİ
too.
The following code prints 0, meaning the strings are equal:
string s1 = "tarafli";
string s2 = "TARAFLİ";
Console.WriteLine(string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase));
And here it is using the Turkish culture (I'm guessing at what the correct culture is).
This also prints 0:
string s1 = "tarafli";
string s2 = "TARAFLİ";
var trlocale = CultureInfo.GetCultureInfo("tr-TR");
Console.WriteLine(string.Compare(s1, s2, trlocale, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase));
Java. Ignore accents when comparing strings
I think you should be using the Collator class. It allows you to set a strength and locale and it will compare characters appropriately.
From the Java 1.6 API:
You can set a Collator's strength
property to determine the level of
difference considered significant in
comparisons. Four strengths are
provided: PRIMARY, SECONDARY,
TERTIARY, and IDENTICAL. The exact
assignment of strengths to language
features is locale dependant. For
example, in Czech, "e" and "f" are
considered primary differences, while
"e" and "ě" are secondary differences,
"e" and "E" are tertiary differences
and "e" and "e" are identical.
I think the important point here (which people are trying to make) is that "Joao"and "João" should never be considered as equal, but if you are doing sorting you don't want them to be compared based on their ASCII value because then you would have something like Joao, John, João, which is not good. Using the collator class definitely handles this correctly.
Linq Contains without considering accents
Borrowing a similar solution form here:
string[] result = {"hello there", "héllo there","goodbye"};
string word = "héllo";
var compareInfo = CultureInfo.InvariantCulture.CompareInfo;
var filtered = result.Where(
p => compareInfo.IndexOf(p, word, CompareOptions.IgnoreNonSpace) > -1);
Related Topics
Reading CSV File and Storing Values into an Array
Show/Hide the Console Window of a C# Console Application
How to Handle Wndproc Messages in Wpf
What Is the Purpose of the "Prefer 32-Bit" Setting in Visual Studio and How Does It Actually Work
Unit Testing Private Methods in C#
How to Convert a Datatable into a Generic List
Getting All Types in a Namespace Via Reflection
How to Get the Cpu Usage in C#
Calculate Md5 Checksum For a File
How to Round a Number to Two Decimal Places in C#
Raw SQL Query Without Dbset - Entity Framework Core
Contains()' Workaround Using Linq to Entities