Escape Invalid Xml Characters in C#

Escape invalid XML characters in C#

As the way to remove invalid XML characters I suggest you to use XmlConvert.IsXmlChar method. It was added since .NET Framework 4 and is presented in Silverlight too. Here is the small sample:

void Main() {
string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False

content = RemoveInvalidXmlChars(content);
Console.WriteLine(IsValidXmlString(content)); // True
}

static string RemoveInvalidXmlChars(string text) {
var validXmlChars = text.Where(ch => XmlConvert.IsXmlChar(ch)).ToArray();
return new string(validXmlChars);
}

static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}

And as the way to escape invalid XML characters I suggest you to use XmlConvert.EncodeName method. Here is the small sample:

void Main() {
const string content = "\v\f\0";
Console.WriteLine(IsValidXmlString(content)); // False

string encoded = XmlConvert.EncodeName(content);
Console.WriteLine(IsValidXmlString(encoded)); // True

string decoded = XmlConvert.DecodeName(encoded);
Console.WriteLine(content == decoded); // True
}

static bool IsValidXmlString(string text) {
try {
XmlConvert.VerifyXmlChars(text);
return true;
} catch {
return false;
}
}

Update:
It should be mentioned that the encoding operation produces a string with a length which is greater or equal than a length of a source string. It might be important when you store a encoded string in a database in a string column with length limitation and validate source string length in your app to fit data column limitation.

Need to remove illegal characters in XML string

There is no general solution to this, because you have no way of determining whether:

<xml>You can use <b></b> to highlight stuff in HTML.</xml>.

is a "mistake" and should actually be encoded:

<xml>You can use <b></b> to highlight stuff in HTML.</xml>.

or not.

Thus, since there is no general solution, you can only use imperfect heuristics to detect such issues.

There is no built-in heuristic in the C# BCL, you will have to roll your own or find some external library. A simple heuristic, for example, would be to find all < which are not followed by [/a-zA-Z0-9]+> and escape them.

Heuristics are intrinsically imperfect, so if you have the opportunity to fix the system creating those broken looks-like-XML-but-isn't files, this would be a much better solution.

Why is the ampersand an invalid character in XML?

Because the & is used to denote an XML entity. It's used as the "escape" character for other invalid characters (e.g. < meaning <), so it can't itself be a valid character in XML. How could you tell whether & was an ampersand, or the beginning of >?

In order to express an ampersand in XML, you need to use &.


This is similar to the way, in C (and similar languages), where \ is used to escape \n (newline), \t (tab), etc., and must itself be escaped as \\.

C# Escape illegal xml characters from node text only

You can use a similar regex to this related question. This essentialy matches all unescaped ampersands (i.e. it will match &, but not &something;).

var xml = @"<node>Something & something < annoying</node>";

var result = Regex.Replace(xml, @"&(?!\w*;)", "&");

// output: <node>Something & something < annoying</node>

What are invalid characters in XML

The only illegal characters are &, < and > (as well as " or ' in attributes, depending on which character is used to delimit the attribute value: attr="must use " here, ' is allowed" and attr='must use ' here, " is allowed').

They're escaped using XML entities, in this case you want & for &.

Really, though, you should use a tool or library that writes XML for you and abstracts this kind of thing away for you so you don't have to worry about it.

How to escape invalid characters inside XML string in C#

I guess the simplest solution is to load the whole Xml document in memory as an XmlDocument and then go through the elements and replace the values with their html encoded form.

Illegal characters in XML

I think your best bet is to "SafeEncode" the string entered by the user. this link http://msdn.microsoft.com/en-us/library/system.security.securityelement.escape(VS.80).aspx shows you how to do it easily with one call to the SecurityElement.Escape(string s) method.

Replace character references for invalid XML characters

I made another go at it using regular expressions. This should handle both decimal and hex character codes. Also, this will not affect anything but numerically encoded characters.

public string ReplaceXMLEncodedCharacters(string input)
{
const string pattern = @"&#(x?)([A-Fa-f0-9]+);";
MatchCollection matches = Regex.Matches(input, pattern);
int offset = 0;
foreach (Match match in matches)
{
int charCode = 0;
if (string.IsNullOrEmpty(match.Groups[1].Value))
charCode = int.Parse(match.Groups[2].Value);
else
charCode = int.Parse(match.Groups[2].Value, System.Globalization.NumberStyles.HexNumber);
char character = (char)charCode;
input = input.Remove(match.Index - offset, match.Length).Insert(match.Index - offset, character.ToString());
offset += match.Length - 1;
}
return input;
}


Related Topics



Leave a reply



Submit