Converting Escaped Xml Entities Back into Utf-8

Converting escaped XML entities back into UTF-8

Well, since it's XML encoded I'd go for an XML parser:

require 'nokogiri'

frag = 'Horrible place. ☠☠☠'
doc = Nokogiri::XML.fragment(frag)
puts doc.text
# >> Horrible place. ☠☠☠

Convert escaped Unicode character back to actual character

try

str = org.apache.commons.lang3.StringEscapeUtils.unescapeJava(str);

from Apache Commons Lang

How can I convert a Java string to xml entities for versions of Unicode beyond 3.0?

Either you are not using correct terminology, or there is a great deal of confusion here.

The &#x character reference notation just specifies a numeric codepoint; it is independent of the version of Unicode used by any reader or parser.

Your code is actually only compatible with Unicode 1.x, because it assumes a character's numeric value is less than 216. As of Unicode 2.0 that is not a correct assumption. Some characters are represented by a single Java char, while other characters are represented by two Java chars (known as surrogates).

I'm not sure what a "UTF-8 Reader" is. A Reader just reads char values, and does not know about UTF-8 or any other charset, except for InputStreamReader, which uses a CharsetDecoder to translate bytes to chars using the UTF-8 encoding (or whatever encoding a particular CharsetDecoder uses).

In any event, no Reader will parse the XML &#x character reference notation. You must use an XML parser for that.

No Reader or XML parser is affected by the Unicode version known to Java, because no Reader or XML parser consults a Unicode database in any way. The characters are just treated as numeric values as they are parsed. Whether they correspond to assigned codepoints in any Unicode version is never considered.

Finally, to write out a String as XML, you can use a Formatter:

static String toXML(String s) {
Formatter formatter = new Formatter();
int len = s.length();
for (int i = 0; i < len; i = s.offsetByCodePoints(i, 1)) {
int c = s.codePointAt(i);
if (c < 32 || c > 126 || c == '&' || c == '<' || c == '>') {
formatter.format("&#x%x;", c);
} else {
formatter.format("%c", c);
}
}
return formatter.toString();
}

As you can see, there is no code that depends on the Unicode version, because the characters are just numeric values. Whether each numeric value is an assigned Unicode codepoint is not relevant.

(My first inclination was to use the XMLStreamWriter class, but it turns out an XMLStreamWriter that uses a non-Unicode encoding such as ISO-8859-1 or US-ASCII does not properly output surrogate pairs as single character entities, as of Java 1.8.0_05.)

What characters do I need to escape in XML documents?

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

XML escape characters

There are only five:

"   "
' '
< <
> >
& &

Escaping characters depends on where the special character is used.

The examples can be validated at the W3C Markup Validation Service.

Text

The safe way is to escape all five characters in text. However, the three characters ", ' and > needn't be escaped in text:

<?xml version="1.0"?>
<valid>"'></valid>

Attributes

The safe way is to escape all five characters in attributes. However, the > character needn't be escaped in attributes:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

<?xml version="1.0"?>
<valid attribute='"'/>

Comments

All five special characters must not be escaped in comments:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

CDATA

All five special characters must not be escaped in CDATA sections:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

Processing instructions

All five special characters must not be escaped in XML processing instructions:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.

xml escape special characters

The file that you are creating is not getting saved to UTF-8; it's probably ASCI. You can prove this to yourself by opening it and using notepad or any other text editing tool that can save files in UTF-8 encoding. In notepad when you "Save as..." you have an option drop down box for the encoding. The default shows you the encoding that the file already is in.

You do not need to escape the Yen character at all. If the file is converted to UTF-8, firefox or any XML interpreter should have no issue with it.

Your error messages lead me to believe that the yen character is a red herring.

expansion character (code 0xb) not a valid XML character

This is a vertical tab character in UTF-8. It sounds like there is some corruption in an encoding conversion. I'm not sure what encoding your SolrRecordCollection object is returning, but I'm guessing it's UTF-8. If you can, find out what encoding the XmlDocument method is returning.

The WebClient.UploadString Method does an encoding conversion:

Before uploading the string, this method converts it to a Byte array
using the encoding specified in the Encoding property.

So I'm guessing what might be happening is that it's trying to take a UTF-8 string and interpret it as a standard .NET UTF-16 string and then converts this misinterpreted data to UTF-8. I think if you convert your XML string variable to UTF-16 before sending it to the method it might fix your problem. Here's a question that answers how to do that:

How do you convert an xml string with UTF-8 encoding UTF-16?

FYI, This article is an easy read to help understand text encodings:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Best way to encode text data for XML in Java?

Very simply: use an XML library. That way it will actually be right instead of requiring detailed knowledge of bits of the XML spec.

Convert XML/HTML Entities into Unicode String in Python

The standard lib’s very own HTMLParser has an undocumented function unescape() which does exactly what you think it does:

up to Python 3.4:

import HTMLParser
h = HTMLParser.HTMLParser()
h.unescape('© 2010') # u'\xa9 2010'
h.unescape('© 2010') # u'\xa9 2010'

Python 3.4+:

import html
html.unescape('© 2010') # u'\xa9 2010'
html.unescape('© 2010') # u'\xa9 2010'


Related Topics



Leave a reply



Submit