How do you remove invalid hexadecimal characters from an XML-based data source prior to constructing an XmlReader or XPathDocument that uses the data?
It may not be perfect (emphasis added since people missing this disclaimer), but what I've done in that case is below. You can adjust to use with a stream.
/// <summary>
/// Removes control characters and other non-UTF-8 characters
/// </summary>
/// <param name="inString">The string to process</param>
/// <returns>A string with no control characters or entities above 0x00FD</returns>
public static string RemoveTroublesomeCharacters(string inString)
{
if (inString == null) return null;
StringBuilder newString = new StringBuilder();
char ch;
for (int i = 0; i < inString.Length; i++)
{
ch = inString[i];
// remove any characters outside the valid UTF-8 range as well as all control characters
// except tabs and new lines
//if ((ch < 0x00FD && ch > 0x001F) || ch == '\t' || ch == '\n' || ch == '\r')
//if using .NET version prior to 4, use above logic
if (XmlConvert.IsXmlChar(ch)) //this method is new in .NET 4
{
newString.Append(ch);
}
}
return newString.ToString();
}
Remove all hexadecimal characters before loading string into XML Document Object?
Here you have an example to clean xml invalid characters using Regex
:
xmlString = CleanInvalidXmlChars(xmlString);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlString);
public static string CleanInvalidXmlChars(string text)
{
string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
return Regex.Replace(text, re, "");
}
Remove all hexadecimal characters before loading string into XML Document Object?
Here you have an example to clean xml invalid characters using Regex
:
xmlString = CleanInvalidXmlChars(xmlString);
XmlDocument xmlDoc = new XmlDocument();
xmlDoc.LoadXml(xmlString);
public static string CleanInvalidXmlChars(string text)
{
string re = @"[^\x09\x0A\x0D\x20-\xD7FF\xE000-\xFFFD\x10000-x10FFFF]";
return Regex.Replace(text, re, "");
}
How to stop XMLReader throwing Invalid XML Character Exception
The problem is that you don't have XML -- you have some string that sure looks like XML but unfortunately doesn't really qualify. Fortunately you can tell XmlReader
to be more lenient:
using (XmlReader reader = XmlReader.Create(new StringReader(xml), new XmlReaderSettings { CheckCharacters = false }))
{
while (reader.Read())
{
//do my thing
}
}
Note that you will still end up with XML that, when serialized, might produce problems further down the line, so you may wish to filter the characters out afterwards anyway as you're reading it.
C# remove not UTF-8 supported values from XML for XslCompiledTransform.Transform
If you look at https://www.w3.org/TR/xml/#charsets you will find the allowed characters with the range [#xE000-#xFFFD]
clearly not including #xFFFE
. So this character is not part of well-formed XML 1.0 document, in your code sample it is not XslCompiledTransform or XSLT rejecting it, it is simply the underlying parser, XmlReader.
If you want to process such mal-formed input with XmlReader you can use the XmlReaderSettings
with CheckCharacters = false
and eliminate such characters, I think, by checking each with e.g. XmlConvert.IsXmlChar
.
With the help of XmlWrappingReader from the MvpXml library (https://github.com/keimpema/Mvp.Xml.NetStandard) you could implement a filtering XmlReader:
public class MyWrappingReader : XmlWrappingReader
{
public MyWrappingReader(XmlReader baseReader) : base(baseReader) { }
public override string Value => base.NodeType == XmlNodeType.Text || base.NodeType == XmlNodeType.CDATA || base.NodeType == XmlNodeType.Attribute ? CleanString(base.Value) : base.Value;
public override string ReadString()
{
if (base.NodeType == XmlNodeType.Text || base.NodeType == XmlNodeType.CDATA || base.NodeType == XmlNodeType.Attribute)
{
return CleanString(base.ReadString());
}
else
{
return base.ReadString();
}
}
public override string GetAttribute(int i)
{
return CleanString(base.GetAttribute(i));
}
public override string GetAttribute(string localName, string namespaceUri)
{
return CleanString(base.GetAttribute(localName, namespaceUri));
}
public override string GetAttribute(string name)
{
return CleanString(base.GetAttribute(name));
}
private string CleanString(string input)
{
return string.Join("", input.ToCharArray().Where(c => XmlConvert.IsXmlChar(c)));
}
}
Then use that reader to filter your input and XslCompiledTransform should work on the cleaned XML e.g. the following runs fine:
string document = "<?xml version=\"1.0\" encoding=\"utf-8\"?><FirstTag><Second att1='value'><Third>a</Third></Second></FirstTag>";
string xsltIndentity = @"<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.0'><xsl:template match='@* | node()'><xsl:copy><xsl:apply-templates select='@* | node()'/></xsl:copy></xsl:template></xsl:stylesheet>";
using (StringReader sr = new StringReader(document))
{
using (XmlReader xr = new MyWrappingReader(XmlReader.Create(sr, new XmlReaderSettings() { CheckCharacters = false })))
{
using (StringReader xsltSrReader = new StringReader(xsltIndentity))
{
using (XmlReader xsltReader = XmlReader.Create(xsltSrReader))
{
XslCompiledTransform processor = new XslCompiledTransform();
processor.Load(xsltReader);
processor.Transform(xr, null, Console.Out);
Console.WriteLine();
}
}
}
}
loading xml without invalid character errors
not sure exactly how it works in c# but if you could some how get it to parse an xml from a 'string' not a file, you could first load the file into a 'string' filter the string yourself, then send it off for xml parsing.
Edit:
maybe 'string' is a poor choice, as the corrupt data might have NULLs in it ect, A byte array or some other generic memory stream structure?
Related Topics
Is There a Faster Way to Scan Through a Directory Recursively in .Net
Creating a PDF from a Rdlc Report in the Background
Pass C# String to C++ and Pass C++ Result (String, Char*.. Whatever) to C#
Publishing a Website Is Not Updating My CSS Bundles
What Is an Algorithm to Diff the Two Strings in the Same Way That So Does on the Version Page
Why C# Implements Methods as Non-Virtual by Default
How to Pass Constructor Parameters to Unity's Resolve() Method
Stop the 'Ding' When Pressing Enter
How to Keep the Delimiters of Regex.Split
Opening a Folder in Explorer and Selecting a File
How to Write Fast Colored Output to Console
How to Intercept All Key Events, Including Ctrl+Alt+Del and Ctrl+Tab
Asp:Fileupload Edit "No File Selected" Message
Signalr(V2.2.0) Ondisconnected Set User Offline
Does Java Have Something Like C#'s Ref and Out Keywords
How to Request Administrator Permissions When the Program Starts