Serializing an Object as Utf-8 Xml in .Net

Serializing an object as UTF-8 XML in .NET

Your code doesn't get the UTF-8 into memory as you read it back into a string again, so its no longer in UTF-8, but back in UTF-16 (though ideally its best to consider strings at a higher level than any encoding, except when forced to do so).

To get the actual UTF-8 octets you could use:

var serializer = new XmlSerializer(typeof(SomeSerializableObject));

var memoryStream = new MemoryStream();
var streamWriter = new StreamWriter(memoryStream, System.Text.Encoding.UTF8);

serializer.Serialize(streamWriter, entry);

byte[] utf8EncodedXml = memoryStream.ToArray();

I've left out the same disposal you've left. I slightly favour the following (with normal disposal left in):

var serializer = new XmlSerializer(typeof(SomeSerializableObject));
using(var memStm = new MemoryStream())
using(var xw = XmlWriter.Create(memStm))
{
serializer.Serialize(xw, entry);
var utf8 = memStm.ToArray();
}

Which is much the same amount of complexity, but does show that at every stage there is a reasonable choice to do something else, the most pressing of which is to serialise to somewhere other than to memory, such as to a file, TCP/IP stream, database, etc. All in all, it's not really that verbose.

How to return xml as UTF-8 instead of UTF-16

Encoding of the Response

I am not quite familiar with this part of the framework. But according to the MSDN you can set the content encoding of an HttpResponse like this:

httpContextBase.Response.ContentEncoding = Encoding.UTF8;

Encoding as seen by the XmlSerializer

After reading your question again I see that this is the tough part. The problem lies within the use of the StringWriter. Because .NET Strings are always stored as UTF-16 (citation needed ^^) the StringWriter returns this as its encoding. Thus the XmlSerializer writes the XML-Declaration as

<?xml version="1.0" encoding="utf-16"?>

To work around that you can write into an MemoryStream like this:

using (MemoryStream stream = new MemoryStream())
using (StreamWriter writer = new StreamWriter(stream, Encoding.UTF8))
{
XmlSerializer xml = new XmlSerializer(typeof(T));
xml.Serialize(writer, Data);

// I am not 100% sure if this can be optimized
httpContextBase.Response.BinaryWrite(stream.ToArray());
}

Other approaches

Another edit: I just noticed this SO answer linked by jtm001. Condensed the solution there is to provide the XmlSerializer with a custom XmlWriter that is configured to use UTF8 as encoding.

Athari proposes to derive from the StringWriter and advertise the encoding as UTF8.

To my understanding both solutions should work as well. I think the take-away here is that you will need one kind of boilerplate code or another...

How to Specify XML Encoding when Serializing an Object in C#

If I change from Encoding.Unicode to Encoding.UTF8, the file is generated properly. Perhaps you're looking at an old version of your file?

In an unrelated bit, you should use using for deterministic disposal of objects which implement IDisposable:

XmlSerializer xmlSerializer = new XmlSerializer(typeof(MyObject));

using (Stream stream = new FileStream(@".\doc.xml", FileMode.Create))
using (XmlWriter xmlWriter = new XmlTextWriter(stream, Encoding.UTF8))
{
xmlSerializer.Serialize(xmlWriter, myObject);
}

XmlSerializer change encoding

Here is a code with encoding as parameter. Please read the comments why there is a SuppressMessage for code analysis.

/// <summary>
/// Serialize an object into an XML string
/// </summary>
/// <typeparam name="T">Type of object to serialize.</typeparam>
/// <param name="obj">Object to serialize.</param>
/// <param name="enc">Encoding of the serialized output.</param>
/// <returns>Serialized (xml) object.</returns>
[System.Diagnostics.CodeAnalysis.SuppressMessage("Microsoft.Usage", "CA2202:Do not dispose objects multiple times")]
internal static String SerializeObject<T>(T obj, Encoding enc)
{
using (MemoryStream ms = new MemoryStream())
{
XmlWriterSettings xmlWriterSettings = new System.Xml.XmlWriterSettings()
{
// If set to true XmlWriter would close MemoryStream automatically and using would then do double dispose
// Code analysis does not understand that. That's why there is a suppress message.
CloseOutput = false,
Encoding = enc,
OmitXmlDeclaration = false,
Indent = true
};
using (System.Xml.XmlWriter xw = System.Xml.XmlWriter.Create(ms, xmlWriterSettings))
{
XmlSerializer s = new XmlSerializer(typeof(T));
s.Serialize(xw, obj);
}

return enc.GetString(ms.ToArray());
}
}

Easier way to serialize C# class as XML text

A little shorter :-)

var yourList = new List<int>() { 1, 2, 3 };
using (var writer = new StringWriter())
{
new XmlSerializer(yourList.GetType()).Serialize(writer, yourList);
var xmlEncodedList = writer.GetStringBuilder().ToString();
}

Although there's a flaw with this previous approach that's worth pointing out. It will generate an utf-16 header as we use StringWriter so it is not exactly equivalent to your code. To get utf-8 header we should use a MemoryStream and an XmlWriter which is an additional line of code:

var yourList = new List<int>() { 1, 2, 3 };
using (var stream = new MemoryStream())
{
using (var writer = XmlWriter.Create(stream))
{
new XmlSerializer(yourList.GetType()).Serialize(writer, yourList);
var xmlEncodedList = Encoding.UTF8.GetString(stream.ToArray());
}
}

Objects not serialising to XML (UTF-8) as expected .net?

What you're seeing is the byte order mark (BOM) that is often used at the start of text files or streams to indicate the byte order and the Unicode variant.

Your serializer is very strange. If you encode a string with some encoding such as UTF-8, you have to return it as an array of bytes. By first encoding the the XML in UTF-8 and then decoding the UTF-8 stream back to a string, you gain nothing (except introducing the problematic BOM).

Either go with UTF-16 only or return a byte array. As the function is now, the encoding just introduces problems.

Update:

Based on the code in the comment below, I'll see two approaches:

Approach 1: Create a string with the serialized data and convert it to UTF-8 late

Public Shared Function SerializeObject(ByVal obj As Object) As String

Dim serializer As New XmlSerializer(obj.GetType)

Using strWriter As New IO.StringWriter()
serializer.Serialize(strWriter, obj)
Return strWriter.ToString
End Using

End Function

....

Dim serialisedObject As String = SerializeObject(object)
Dim postData As Byte() = New Text.UTF8Encoding(True).GetBytes(serialisedObject)

If you need a differnt encoding, change the last line. If you want to omit the byte order mark, pass False to UTF8Encoding().

Approach 2: Create the properly encoded data in the first place and continue with a byte array

Public Shared Function SerializeObject(ByVal obj As Object, ByVal encoding As Text.Encoding) As Byte()

Dim serializer As New XmlSerializer(obj.GetType)

If encoding Is Nothing Then
Set encoding = Encoding.Unicode
End If

Using stream As New IO.MemoryStream, xtWriter As New Xml.XmlTextWriter(stream, encoding)
serializer.Serialize(xtWriter, obj)
Return stream.ToArray()
End Using

End Function

....

Dim postData As Byte() = SerializeObject(object)

In this case, the XmlTextWriter directly encodes the data with the correct encoding. As since we have a byte array already, the last step is shorter: we directly have the data to send to the client.

Setting StandAlone = Yes in .Net when serializing an object

If you want to do this then you'll need to use WriteProcessingInstruction method and manually write it out.

    Using stream As New IO.MemoryStream, xtWriter As Xml.XmlWriter = Xml.XmlWriter.Create(stream, settings)
xtWriter.WriteProcessingInstruction("xml", "version=""1.0"" encoding=""UTF-8"" standalone=""yes""")
serializer.Serialize(xtWriter, obj)
Return encoding.GetString(stream.ToArray())
End Using


Related Topics



Leave a reply



Submit