How to Strip Non-Ascii Characters from a String In C#

How can you strip non-ASCII characters from a string? (in C#)


string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);

Remove all non-ASCII characters from string


string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))

Replace all non ascii characters with a code

This will help:

var source =  "Helloµ±";
var sb = new StringBuilder();
foreach (char c in source)
{
if (c == '_')
{
// special case: Replace _ With _5f
sb.Append("_5f");
}
else if (c < 32 || c > 127)
{
// handle non-ascii by using hex representation of bytes
// TODO: check whether "surrogate pairs" are handled correctly (if required)
var ba = Encoding.UTF8.GetBytes(new[] { c });
foreach (byte b in ba)
{
sb.AppendFormat("_{0:x2}", b);
}
}
else
{
// in printable ASCII range, so just copy
sb.Append(c);
}
}

Console.WriteLine(sb.ToString());

This results in "Hello_c2_b5_c2_b1"

It is up to you to wrap this in a nice method.


Late addition: The first two tests can be combined, as _ just has to be replaced by its byte representation, to avoid confusion about what an _ means in the result:

if (c == '_' || c < 32 || c > 127)
{
var ba = Encoding.UTF8.GetBytes(new[] { c });
foreach (byte b in ba)
{
sb.AppendFormat("_{0:x2}", b);
}
}
else
{
sb.Append(c);
}

How remove non-ascii from string - c#

I compared several options I found/thought of:

string text = "Hello world ☀⛿END";

Console.WriteLine(text);
Console.WriteLine(Regex.Replace(text, @"\p{Cs}", ""));
Console.WriteLine(Regex.Replace(text, @"[^\u0000-\u007F]+", ""));
Console.WriteLine(text.Where(c => !Char.IsSurrogate(c)).ToArray());

And this is the outcome:

Hello world ??????END
Hello world ??END
Hello world END
Hello world ??END

I am not sure if your input string, after being copied, pasted here, copied again and pasted into Visual Studio suffers some modification in the process, but from what I see, obviously the second option seems to work better.

Do you want to remove all special characters or only emoji?

Remove unwanted unicode characters from string

testString = Regex.Replace(testString, @"[\u0000-\u0008\u000A-\u001F\u0100-\uFFFF]", "");

or

testString = Regex.Replace(testString, @"[^\t\r\n -~]", "");

How to remove non-ASCII word from a string in C#

You could use a regular expression to filter non ASCII characters:

string input = "AB £ CD";
string result = Regex.Replace(input, "[^\x0d\x0a\x20-\x7e\t]", "");

Strip non ascii chars but allow currency symbols

Yes, your regex is correct.

What you are doing with your code is replacing the characters matched by your regular expressions by an empty character.

Now, what characters does your regular expression match?

Anything except:

  • The range you specified: 0000-007F
  • Currency symbol characters: \p{Sc}. See http://regular-expressions.info/unicode.html#prop

If you just want to keep allowing some other characters, yes, you can add them too (exactly like you did with \p{Sc}.

Edit:

Be careful when doing it in the future. The regex would really be [^\u0000-\u007F\p{Sc}] (no space), although in this case it doesn't matter since the space character was already in the ASCII range.



Related Topics



Leave a reply



Submit