How to Remove Invalid Hexadecimal Characters from an Xml-Based Data Source Prior to Constructing an Xmlreader or Xpathdocument That Uses the Data

How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data?

It may not be perfect (emphasis added since people missing this disclaimer), but what I've done in that case is below. You can adjust to use with a stream.

/// <summary>
/// Removes control characters and other non-UTF-8 characters
/// </summary>
/// <param name="inString">The string to process</param>
/// <returns>A string with no control characters or entities above 0x00FD</returns>
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null) return null;

StringBuilder newString = new StringBuilder();
char ch;

for (int i = 0; i < inString.Length; i++)
{

ch = inString[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
//if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
//if using .NET version prior to 4, use above logic
if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
{
newString.Append(ch);
}
}
return newString.ToString();

}

Remove all hexadecimal characters before loading string into XML Document Object?

Here you have an example to clean xml invalid characters using Regex:

 xmlString = CleanInvalidXmlChars(xmlString);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlString);

public static string CleanInvalidXmlChars(string text)
{
string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
return Regex.Replace(text, re, "");
}

Remove all hexadecimal characters before loading string into XML Document Object?

Here you have an example to clean xml invalid characters using Regex:

 xmlString = CleanInvalidXmlChars(xmlString);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlString);

public static string CleanInvalidXmlChars(string text)
{
string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
return Regex.Replace(text, re, "");
}

How to stop XMLReader throwing Invalid XML Character Exception

The problem is that you don't have XML -- you have some string that sure looks like XML but unfortunately doesn't really qualify. Fortunately you can tell XmlReader to be more lenient:

using (XmlReader reader = XmlReader.Create(new StringReader(xml), new XmlReaderSettings { CheckCharacters = false }))
{
while (reader.Read())
{
//do my thing
}
}

Note that you will still end up with XML that, when serialized, might produce problems further down the line, so you may wish to filter the characters out afterwards anyway as you're reading it.

C# remove not UTF-8 supported values from XML for XslCompiledTransform.Transform

If you look at https://www.w3.org/TR/xml/#charsets you will find the allowed characters with the range [#xE000-#xFFFD] clearly not including #xFFFE. So this character is not part of well-formed XML 1.0 document, in your code sample it is not XslCompiledTransform or XSLT rejecting it, it is simply the underlying parser, XmlReader.

If you want to process such mal-formed input with XmlReader you can use the XmlReaderSettings with CheckCharacters = false and eliminate such characters, I think, by checking each with e.g. XmlConvert.IsXmlChar.

With the help of XmlWrappingReader from the MvpXml library (https://github.com/keimpema/Mvp.Xml.NetStandard) you could implement a filtering XmlReader:

public class MyWrappingReader : XmlWrappingReader
{
public MyWrappingReader(XmlReader baseReader) : base(baseReader) { }

public override string Value => base.NodeType == XmlNodeType.Text || base.NodeType == XmlNodeType.CDATA || base.NodeType == XmlNodeType.Attribute ? CleanString(base.Value) : base.Value;

public override string ReadString()
{
if (base.NodeType == XmlNodeType.Text || base.NodeType == XmlNodeType.CDATA || base.NodeType == XmlNodeType.Attribute)
{
return CleanString(base.ReadString());
}
else
{
return base.ReadString();
}
}

public override string GetAttribute(int i)
{
return CleanString(base.GetAttribute(i));
}

public override string GetAttribute(string localName, string namespaceUri)
{
return CleanString(base.GetAttribute(localName, namespaceUri));
}

public override string GetAttribute(string name)
{
return CleanString(base.GetAttribute(name));
}

private string CleanString(string input)
{
return string.Join("", input.ToCharArray().Where(c => XmlConvert.IsXmlChar(c)));
}
}

Then use that reader to filter your input and XslCompiledTransform should work on the cleaned XML e.g. the following runs fine:

       string document = "<?xml version=\"1.0\" encoding=\"utf-8\"?><FirstTag><Second att1='value￾'><Third>a￾</Third></Second></FirstTag>";

string xsltIndentity = @"<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'><xsl:template match='@* | node()'><xsl:copy><xsl:apply-templates select='@* | node()'/></xsl:copy></xsl:template></xsl:stylesheet>";

using (StringReader sr = new StringReader(document))
{
using (XmlReader xr = new MyWrappingReader(XmlReader.Create(sr, new XmlReaderSettings() { CheckCharacters = false })))
{
using (StringReader xsltSrReader = new StringReader(xsltIndentity))
{
using (XmlReader xsltReader = XmlReader.Create(xsltSrReader))
{
XslCompiledTransform processor = new XslCompiledTransform();
processor.Load(xsltReader);
processor.Transform(xr, null, Console.Out);
Console.WriteLine();
}
}
}
}

loading xml without invalid character errors

not sure exactly how it works in c# but if you could some how get it to parse an xml from a 'string' not a file, you could first load the file into a 'string' filter the string yourself, then send it off for xml parsing.

Edit:

maybe 'string' is a poor choice, as the corrupt data might have NULLs in it ect, A byte array or some other generic memory stream structure?



Related Topics



Leave a reply



Submit