How can you strip non-ASCII characters from a string? (in C#)
string s = "søme string";
s = Regex.Replace(s, @"[^\u0000-\u007F]+", string.Empty);
Remove all non-ASCII characters from string
string sOut = Encoding.ASCII.GetString(Encoding.ASCII.GetBytes(s))
Replace all non ascii characters with a code
This will help:
var source = "Helloµ±";
var sb = new StringBuilder();
foreach (char c in source)
{
if (c == '_')
{
// special case: Replace _ With _5f
sb.Append("_5f");
}
else if (c < 32 || c > 127)
{
// handle non-ascii by using hex representation of bytes
// TODO: check whether "surrogate pairs" are handled correctly (if required)
var ba = Encoding.UTF8.GetBytes(new[] { c });
foreach (byte b in ba)
{
sb.AppendFormat("_{0:x2}", b);
}
}
else
{
// in printable ASCII range, so just copy
sb.Append(c);
}
}
Console.WriteLine(sb.ToString());
This results in "Hello_c2_b5_c2_b1"
It is up to you to wrap this in a nice method.
Late addition: The first two tests can be combined, as _
just has to be replaced by its byte representation, to avoid confusion about what an _ means in the result:
if (c == '_' || c < 32 || c > 127)
{
var ba = Encoding.UTF8.GetBytes(new[] { c });
foreach (byte b in ba)
{
sb.AppendFormat("_{0:x2}", b);
}
}
else
{
sb.Append(c);
}
How remove non-ascii from string - c#
I compared several options I found/thought of:
string text = "Hello world ☀⛿END";
Console.WriteLine(text);
Console.WriteLine(Regex.Replace(text, @"\p{Cs}", ""));
Console.WriteLine(Regex.Replace(text, @"[^\u0000-\u007F]+", ""));
Console.WriteLine(text.Where(c => !Char.IsSurrogate(c)).ToArray());
And this is the outcome:
Hello world ??????END
Hello world ??END
Hello world END
Hello world ??END
I am not sure if your input string, after being copied, pasted here, copied again and pasted into Visual Studio suffers some modification in the process, but from what I see, obviously the second option seems to work better.
Do you want to remove all special characters or only emoji?
Remove unwanted unicode characters from string
testString = Regex.Replace(testString, @"[\u0000-\u0008\u000A-\u001F\u0100-\uFFFF]", "");
or
testString = Regex.Replace(testString, @"[^\t\r\n -~]", "");
How to remove non-ASCII word from a string in C#
You could use a regular expression to filter non ASCII characters:
string input = "AB £ CD";
string result = Regex.Replace(input, "[^\x0d\x0a\x20-\x7e\t]", "");
Strip non ascii chars but allow currency symbols
Yes, your regex is correct.
What you are doing with your code is replacing the characters matched by your regular expressions by an empty character.
Now, what characters does your regular expression match?
Anything except:
- The range you specified:
0000-007F
- Currency symbol characters:
\p{Sc}
. See http://regular-expressions.info/unicode.html#prop
If you just want to keep allowing some other characters, yes, you can add them too (exactly like you did with \p{Sc}
.
Edit:
Be careful when doing it in the future. The regex would really be [^\u0000-\u007F\p{Sc}]
(no space), although in this case it doesn't matter since the space character was already in the ASCII range.
Related Topics
Sending Http Requests in C# With Unity
Linq to Entities Case Sensitive Comparison
Floating Point Comparison Functions For C#
Prevent Multiple Instances of a Given App in .Net
How to Send Email in ASP.NET C#
How to Change Row Color in Datagridview
How to Get an Htmlelement Value Inside Frames/Iframes
Which .Net Dependency Injection Frameworks Are Worth Looking Into
Jquery Ui Dialog With ASP.NET Button Postback
How to Make a Window Always Stay on Top in .Net
How to Build Splash Screen in Windows Forms Application
Translucent Circular Control With Text
Execute a Large SQL Script (With Go Commands)
Check Whether an Array Is a Subset of Another
How to Convert a Datatable into a Generic List