Using Stringwriter for Xml Serialization

Using StringWriter for XML Serialization

<TL;DR> The problem is rather simple, actually: you are not matching the declared encoding (in the XML declaration) with the datatype of the input parameter. If you manually added <?xml version="1.0" encoding="utf-8"?><test/> to the string, then declaring the SqlParameter to be of type SqlDbType.Xml or SqlDbType.NVarChar would give you the "unable to switch the encoding" error. Then, when inserting manually via T-SQL, since you switched the declared encoding to be utf-16, you were clearly inserting a VARCHAR string (not prefixed with an upper-case "N", hence an 8-bit encoding, such as UTF-8) and not an NVARCHAR string (prefixed with an upper-case "N", hence the 16-bit UTF-16 LE encoding).

The fix should have been as simple as:

  1. In the first case, when adding the declaration stating encoding="utf-8": simply don't add the XML declaration.
  2. In the second case, when adding the declaration stating encoding="utf-16": either

    1. simply don't add the XML declaration, OR
    2. simply add an "N" to the input parameter type: SqlDbType.NVarChar instead of SqlDbType.VarChar :-) (or possibly even switch to using SqlDbType.Xml)

(Detailed response is below)


All of the answers here are over-complicated and unnecessary (regardless of the 121 and 184 up-votes for Christian's and Jon's answers, respectively). They might provide working code, but none of them actually answer the question. The issue is that nobody truly understood the question, which ultimately is about how the XML datatype in SQL Server works. Nothing against those two clearly intelligent people, but this question has little to nothing to do with serializing to XML. Saving XML data into SQL Server is much easier than what is being implied here.

It doesn't really matter how the XML is produced as long as you follow the rules of how to create XML data in SQL Server. I have a more thorough explanation (including working example code to illustrate the points outlined below) in an answer on this question: How to solve “unable to switch the encoding” error when inserting XML into SQL Server, but the basics are:

  1. The XML declaration is optional
  2. The XML datatype stores strings always as UCS-2 / UTF-16 LE
  3. If your XML is UCS-2 / UTF-16 LE, then you:

    1. pass in the data as either NVARCHAR(MAX) or XML / SqlDbType.NVarChar (maxsize = -1) or SqlDbType.Xml, or if using a string literal then it must be prefixed with an upper-case "N".
    2. if specifying the XML declaration, it must be either "UCS-2" or "UTF-16" (no real difference here)
  4. If your XML is 8-bit encoded (e.g. "UTF-8" / "iso-8859-1" / "Windows-1252"), then you:

    1. need to specify the XML declaration IF the encoding is different than the code page specified by the default Collation of the database
    2. you must pass in the data as VARCHAR(MAX) / SqlDbType.VarChar (maxsize = -1), or if using a string literal then it must not be prefixed with an upper-case "N".
    3. Whatever 8-bit encoding is used, the "encoding" noted in the XML declaration must match the actual encoding of the bytes.
    4. The 8-bit encoding will be converted into UTF-16 LE by the XML datatype

With the points outlined above in mind, and given that strings in .NET are always UTF-16 LE / UCS-2 LE (there is no difference between those in terms of encoding), we can answer your questions:

Is there a reason why I shouldn't use StringWriter to serialize an Object when I need it as a string afterwards?

No, your StringWriter code appears to be just fine (at least I see no issues in my limited testing using the 2nd code block from the question).

Wouldn't setting the encoding to UTF-16 (in the xml tag) work then?

It isn't necessary to provide the XML declaration. When it is missing, the encoding is assumed to be UTF-16 LE if you pass the string into SQL Server as NVARCHAR (i.e. SqlDbType.NVarChar) or XML (i.e. SqlDbType.Xml). The encoding is assumed to be the default 8-bit Code Page if passing in as VARCHAR (i.e. SqlDbType.VarChar). If you have any non-standard-ASCII characters (i.e. values 128 and above) and are passing in as VARCHAR, then you will likely see "?" for BMP characters and "??" for Supplementary Characters as SQL Server will convert the UTF-16 string from .NET into an 8-bit string of the current Database's Code Page before converting it back into UTF-16 / UCS-2. But you shouldn't get any errors.

On the other hand, if you do specify the XML declaration, then you must pass into SQL Server using the matching 8-bit or 16-bit datatype. So if you have a declaration stating that the encoding is either UCS-2 or UTF-16, then you must pass in as SqlDbType.NVarChar or SqlDbType.Xml. Or, if you have a declaration stating that the encoding is one of the 8-bit options (i.e. UTF-8, Windows-1252, iso-8859-1, etc), then you must pass in as SqlDbType.VarChar. Failure to match the declared encoding with the proper 8 or 16 -bit SQL Server datatype will result in the "unable to switch the encoding" error that you were getting.

For example, using your StringWriter-based serialization code, I simply printed the resulting string of the XML and used it in SSMS. As you can see below, the XML declaration is included (because StringWriter does not have an option to OmitXmlDeclaration like XmlWriter does), which poses no problem so long as you pass the string in as the correct SQL Server datatype:

-- Upper-case "N" prefix == NVARCHAR, hence no error:
DECLARE @Xml XML = N'<?xml version="1.0" encoding="utf-16"?>
<string>Test ሴlt;/string>';
SELECT @Xml;
-- <string>Test ሴlt;/string>

As you can see, it even handles characters beyond standard ASCII, given that is BMP Code Point U+1234, and /code> is Supplementary Character Code Point U+1F638. However, the following:

-- No upper-case "N" prefix on the string literal, hence VARCHAR:
DECLARE @Xml XML = '<?xml version="1.0" encoding="utf-16"?>
<string>Test ሴlt;/string>';

results in the following error:

Msg 9402, Level 16, State 1, Line XXXXX
XML parsing: line 1, character 39, unable to switch the encoding

Ergo, all of that explanation aside, the full solution to your original question is:

You were clearly passing the string in as SqlDbType.VarChar. Switch to SqlDbType.NVarChar and it will work without needing to go through the extra step of removing the XML declaration. This is preferred over keeping SqlDbType.VarChar and removing the XML declaration because this solution will prevent data loss when the XML includes non-standard-ASCII characters. For example:

-- No upper-case "N" prefix on the string literal == VARCHAR, and no XML declaration:
DECLARE @Xml2 XML = '<string>Test ሴlt;/string>';
SELECT @Xml2;
-- <string>Test ???</string>

As you can see, there is no error this time, but now there is data-loss 🙀.

How to return xml as UTF-8 instead of UTF-16


Encoding of the Response

I am not quite familiar with this part of the framework. But according to the MSDN you can set the content encoding of an HttpResponse like this:

httpContextBase.Response.ContentEncoding = Encoding.UTF8;

Encoding as seen by the XmlSerializer

After reading your question again I see that this is the tough part. The problem lies within the use of the StringWriter. Because .NET Strings are always stored as UTF-16 (citation needed ^^) the StringWriter returns this as its encoding. Thus the XmlSerializer writes the XML-Declaration as

<?xml version="1.0" encoding="utf-16"?>

To work around that you can write into an MemoryStream like this:

using (MemoryStream stream = new MemoryStream())
using (StreamWriter writer = new StreamWriter(stream, Encoding.UTF8))
{
XmlSerializer xml = new XmlSerializer(typeof(T));
xml.Serialize(writer, Data);

// I am not 100% sure if this can be optimized
httpContextBase.Response.BinaryWrite(stream.ToArray());
}

Other approaches

Another edit: I just noticed this SO answer linked by jtm001. Condensed the solution there is to provide the XmlSerializer with a custom XmlWriter that is configured to use UTF8 as encoding.

Athari proposes to derive from the StringWriter and advertise the encoding as UTF8.

To my understanding both solutions should work as well. I think the take-away here is that you will need one kind of boilerplate code or another...

Can I Serialize XML straight to a string instead of a Stream with C#?

Fun with extension methods...

var ret = john.ToXmlString()

public static class XmlTools
{
public static string ToXmlString<T>(this T input)
{
using (var writer = new StringWriter())
{
input.ToXml(writer);
return writer.ToString();
}
}
public static void ToXml<T>(this T objectToSerialize, Stream stream)
{
new XmlSerializer(typeof(T)).Serialize(stream, objectToSerialize);
}

public static void ToXml<T>(this T objectToSerialize, StringWriter writer)
{
new XmlSerializer(typeof(T)).Serialize(writer, objectToSerialize);
}
}

XmlSerializer change encoding

Here is a code with encoding as parameter. Please read the comments why there is a SuppressMessage for code analysis.

/// <summary>
/// Serialize an object into an XML string
/// </summary>
/// <typeparam name="T">Type of object to serialize.</typeparam>
/// <param name="obj">Object to serialize.</param>
/// <param name="enc">Encoding of the serialized output.</param>
/// <returns>Serialized (xml) object.</returns>
[System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Usage", "CA2202:Do not dispose objects multiple times")]
internal static String SerializeObject<T>(T obj, Encoding enc)
{
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings xmlWriterSettings = new System.Xml.XmlWriterSettings()
{
// If set to true XmlWriter would close MemoryStream automatically and using would then do double dispose
// Code analysis does not understand that. That's why there is a suppress message.
CloseOutput = false,
Encoding = enc,
OmitXmlDeclaration = false,
Indent = true
};
using (System.Xml.XmlWriter xw = System.Xml.XmlWriter.Create(ms, xmlWriterSettings))
{
XmlSerializer s = new XmlSerializer(typeof(T));
s.Serialize(xw, obj);
}

return enc.GetString(ms.ToArray());
}
}


Related Topics



Leave a reply



Submit