Escape invalid XML characters in C#
As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:
void Main() {
string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}
static string RemoveInvalidXmlChars(string text) {
var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:
void Main() {
const string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False
string encoded = XmlConvert.EncodeName(content);
Console.WriteLine(IsValidXmlString(encoded)); // True
string decoded = XmlConvert.DecodeName(encoded);
Console.WriteLine(content == decoded); // True
}
static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}
Update:
It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.
Need to remove illegal characters in XML string
There is no general solution to this, because you have no way of determining whether:
<xml>You can use <b></b> to highlight stuff in HTML.</xml>.
is a "mistake" and should actually be encoded:
<xml>You can use <b></b> to highlight stuff in HTML.</xml>.
or not.
Thus, since there is no general solution, you can only use imperfect heuristics to detect such issues.
There is no built-in heuristic in the C# BCL, you will have to roll your own or find some external library. A simple heuristic, for example, would be to find all <
which are not followed by [/a-zA-Z0-9]+>
and escape them.
Heuristics are intrinsically imperfect, so if you have the opportunity to fix the system creating those broken looks-like-XML-but-isn't files, this would be a much better solution.
Why is the ampersand an invalid character in XML?
Because the &
is used to denote an XML entity. It's used as the "escape" character for other invalid characters (e.g. <
meaning <
), so it can't itself be a valid character in XML. How could you tell whether &
was an ampersand, or the beginning of >
?
In order to express an ampersand in XML, you need to use &
.
This is similar to the way, in C (and similar languages), where \
is used to escape \n
(newline), \t
(tab), etc., and must itself be escaped as \\
.
C# Escape illegal xml characters from node text only
You can use a similar regex to this related question. This essentialy matches all unescaped ampersands (i.e. it will match &
, but not &something;
).
var xml = @"<node>Something & something < annoying</node>";
var result = Regex.Replace(xml, @"&(?!\w*;)", "&");
// output: <node>Something & something < annoying</node>
What are invalid characters in XML
The only illegal characters are &
, <
and >
(as well as "
or '
in attributes, depending on which character is used to delimit the attribute value: attr="must use " here, ' is allowed"
and attr='must use ' here, " is allowed'
).
They're escaped using XML entities, in this case you want &
for &
.
Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.
How to escape invalid characters inside XML string in C#
I guess the simplest solution is to load the whole Xml document in memory as an XmlDocument and then go through the elements and replace the values with their html encoded form.
Illegal characters in XML
I think your best bet is to "SafeEncode" the string entered by the user. this link http://msdn.microsoft.com/en-us/library/system.security.securityelement.escape(VS.80).aspx shows you how to do it easily with one call to the SecurityElement.Escape(string s) method.
Replace character references for invalid XML characters
I made another go at it using regular expressions. This should handle both decimal and hex character codes. Also, this will not affect anything but numerically encoded characters.
public string ReplaceXMLEncodedCharacters(string input)
{
const string pattern = @"(x?)([A-Fa-f0-9]+);";
MatchCollection matches = Regex.Matches(input, pattern);
int offset = 0;
foreach (Match match in matches)
{
int charCode = 0;
if (string.IsNullOrEmpty(match.Groups[1].Value))
charCode = int.Parse(match.Groups[2].Value);
else
charCode = int.Parse(match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
char character = (char)charCode;
input = input.Remove(match.Index - offset, match.Length).Insert(match.Index - offset, character.ToString());
offset += match.Length - 1;
}
return input;
}
Related Topics
Replace Multiple String Elements in C#
Deserializing into a List Without a Container Element in Xml
Read Connection String from Web.Config
Should C# Have Multiple Inheritance
Pass Connection String to Code-First Dbcontext
Is Accessing a Variable in C# an Atomic Operation
C# Thread Termination and Thread.Abort()
How to Pass SQLparameter to In()
How to Deploy Application with SQL Server Database on Clients
Convert Comma Separated String of Ints to Int Array
C# Selenium 'Expectedconditions Is Obsolete'
Trying to Use the C# Spellcheck Class
Implementing Zero or One to Zero or One Relationship in Ef Code First by Fluent API