Ignoring Accented Letters in String Comparison

Ignoring accented letters in string comparison

EDIT 2012-01-20: Oh boy! The solution was so much simpler and has been in the framework nearly forever. As pointed out by knightpfhor :

string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace);

Here's a function that strips diacritics from a string:

static string RemoveDiacritics(string text)
{
  string formD = text.Normalize(NormalizationForm.FormD);
  StringBuilder sb = new StringBuilder();

  foreach (char ch in formD)
  {
    UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(ch);
    if (uc != UnicodeCategory.NonSpacingMark)
    {
      sb.Append(ch);
    }
  }

  return sb.ToString().Normalize(NormalizationForm.FormC);
}

More details on MichKap's blog (RIP...).

The principle is that is it turns 'é' into 2 successive chars 'e', acute.
It then iterates through the chars and skips the diacritics.

"héllo" becomes "he<acute>llo", which in turn becomes "hello".

Debug.Assert("hello"==RemoveDiacritics("héllo"));

Note: Here's a more compact .NET4+ friendly version of the same function:

static string RemoveDiacritics(string text)
{
  return string.Concat( 
      text.Normalize(NormalizationForm.FormD)
      .Where(ch => CharUnicodeInfo.GetUnicodeCategory(ch)!=
                                    UnicodeCategory.NonSpacingMark)
    ).Normalize(NormalizationForm.FormC);
}

Compare strings ignoring accented characters

You can use java Collators for comparing the tests ignoring the accent, see a simple example:

import java.text.Collator;

/**
 * @author Kennedy
 */
public class SimpleTest
{

  public static void main(String[] args)
  {
    String a = "nocao";
    String b = "noção";

    final Collator instance = Collator.getInstance();

    // This strategy mean it'll ignore the accents
    instance.setStrength(Collator.NO_DECOMPOSITION);

    // Will print 0 because its EQUAL
    System.out.println(instance.compare(a, b));
  }
}

Documentation: JavaDoc

I'll not explain in details because i used just a little of Collators and i'm not a expert in it, but you can google there's some articles about it.

Ignore accent letters in string comparison in Visual Studio

One of the possible answers is to use the RemoveDiacritcs approach.

static string RemoveDiacritics(string text) 
{
    var normalizedString = text.Normalize(NormalizationForm.FormD);
    var stringBuilder = new StringBuilder();

    foreach (var c in normalizedString)
    {
        var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);
        if (unicodeCategory != UnicodeCategory.NonSpacingMark)
        {
            stringBuilder.Append(c);
        }
    }

    return stringBuilder.ToString().Normalize(NormalizationForm.FormC);
}

More info here : How do I remove diacritics (accents) from a string in .NET?

How can I do String.StartsWith ignoring diacritics/accents/tildes?

You could try utilizing the System.Globalization.CompareInfo.IsPrefix method, which accepts a CompareOptions enum. See the docs here.

Besides this, you can try to manually implement it for youself based on for example this answer, which deals with removing diacritics from a string.

Compare two string and ignore (but not replace) accents. PHP

Just convert the accents to their non-accented counter part and then compare strings. The function in my answer will remove the accents for you.

function removeAccents($string) {
    return strtolower(trim(preg_replace('~[^0-9a-z]+~i', '-', preg_replace('~&([a-z]{1,2})(acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1', htmlentities($string, ENT_QUOTES, 'UTF-8'))), ' '));
}

$a = "joaoaaeeA";
$b = "joãoâàéèÀ";

var_dump(removeAccents($a) === removeAccents($b));

Output:

bool(true)

Demo

How to compare strings with case insensitive and accent insensitive

To ignore both case AND accents, you can use string.Compare() with both the IgnoreNonSpace AND the IgnoreCase options, like so:

string s1 = "http://www.buroteknik.com/metylan-c387c4b0ft-tarafli-bant-12cm-x25mt_154202.html";
string s2 = "http://www.buroteknik.com/METYLAN-C387C4B0FT-TARAFLI-BANT-12cm-x25mt_154202.html";
string s3 = "http://www.buroteknik.com/METYLAN-C387C4B0FT-TARAFLı-BANT-12cm-x25mt_154202.html";

Console.WriteLine(string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase));
Console.WriteLine(string.Compare(s2, s3, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase));

In response to your comments below, this works for tarafli and TARAFLİ too.

The following code prints 0, meaning the strings are equal:

string s1 = "tarafli";
string s2 = "TARAFLİ";
Console.WriteLine(string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase));

And here it is using the Turkish culture (I'm guessing at what the correct culture is).
This also prints 0:

string s1 = "tarafli";
string s2 = "TARAFLİ";

var trlocale = CultureInfo.GetCultureInfo("tr-TR");
Console.WriteLine(string.Compare(s1, s2, trlocale, CompareOptions.IgnoreNonSpace | CompareOptions.IgnoreCase));

Java. Ignore accents when comparing strings

I think you should be using the Collator class. It allows you to set a strength and locale and it will compare characters appropriately.

From the Java 1.6 API:

You can set a Collator's strength
property to determine the level of
difference considered significant in
comparisons. Four strengths are
provided: PRIMARY, SECONDARY,
TERTIARY, and IDENTICAL. The exact
assignment of strengths to language
features is locale dependant. For
example, in Czech, "e" and "f" are
considered primary differences, while
"e" and "ě" are secondary differences,
"e" and "E" are tertiary differences
and "e" and "e" are identical.

I think the important point here (which people are trying to make) is that "Joao"and "João" should never be considered as equal, but if you are doing sorting you don't want them to be compared based on their ASCII value because then you would have something like Joao, John, João, which is not good. Using the collator class definitely handles this correctly.

Linq Contains without considering accents

Borrowing a similar solution form here:

string[] result = {"hello there", "héllo there","goodbye"};

string word = "héllo";

var compareInfo = CultureInfo.InvariantCulture.CompareInfo;

var filtered = result.Where(
      p => compareInfo.IndexOf(p, word, CompareOptions.IgnoreNonSpace) > -1);

Ignoring Accented Letters in String Comparison