String Escape into Xml

What characters do I need to escape in XML documents?

If you use an appropriate class or library, they will do the escaping for you. Many XML issues are caused by string concatenation.

XML escape characters

There are only five:

"   "
' '
< <
> >
& &

Escaping characters depends on where the special character is used.

The examples can be validated at the W3C Markup Validation Service.

Text

The safe way is to escape all five characters in text. However, the three characters ", ' and > needn't be escaped in text:

<?xml version="1.0"?>
<valid>"'></valid>

Attributes

The safe way is to escape all five characters in attributes. However, the > character needn't be escaped in attributes:

<?xml version="1.0"?>
<valid attribute=">"/>

The ' character needn't be escaped in attributes if the quotes are ":

<?xml version="1.0"?>
<valid attribute="'"/>

Likewise, the " needn't be escaped in attributes if the quotes are ':

<?xml version="1.0"?>
<valid attribute='"'/>

Comments

All five special characters must not be escaped in comments:

<?xml version="1.0"?>
<valid>
<!-- "'<>& -->
</valid>

CDATA

All five special characters must not be escaped in CDATA sections:

<?xml version="1.0"?>
<valid>
<![CDATA["'<>&]]>
</valid>

Processing instructions

All five special characters must not be escaped in XML processing instructions:

<?xml version="1.0"?>
<?process <"'&> ?>
<valid/>

XML vs. HTML

HTML has its own set of escape codes which cover a lot more characters.

How can I escape & in XML?

Use & in place of &.

Change it to:

<string name="magazine">Newspaper & Magazines</string>

String escape into XML

public static string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerText = unescaped;
return node.InnerXml;
}

public static string XmlUnescape(string escaped)
{
XmlDocument doc = new XmlDocument();
XmlNode node = doc.CreateElement("root");
node.InnerXml = escaped;
return node.InnerText;
}

How to escape the and ≤ symbols in xml?

You will have to escape those symbols as following:

  • < will be <
  • > will be >

The sign doesn't need escaping.

Escaping strings for use in XML

Do you mean you do something like this:

from xml.dom.minidom import Text, Element

t = Text()
e = Element('p')

t.data = '<bar><a/><baz spam="eggs"> & blabla &entity;</>'
e.appendChild(t)

Then you will get nicely escaped XML string:

>>> e.toxml()
'<p><bar><a/><baz spam="eggs"> & blabla &entity;</></p>'

string escape into XML-Attribute

Modifying the solution you referenced, how about

public static string XmlEscape(string unescaped)
{
XmlDocument doc = new XmlDocument();
var node = doc.CreateAttribute("foo");
node.InnerText = unescaped;
return node.InnerXml;
}

All I did was change CreateElement() to CreateAttribute().
The attribute node type does have InnerText and InnerXml properties.

I don't have the environment to test this in, but I'd be curious to know if it works.

Update: Or more simply, use SecurityElement.Escape() as suggested in another answer to the question you linked to. This will escape quotation marks, so it's suitable for using for attribute text.

Update 2: Please note that carriage returns and line feeds do not need to be escaped in an attribute value, in order for the XML to be well-formed. If you want them to be escaped for other reasons, you can do it using String.replace(), e.g.

SecurityElement.Escape(unescaped).Replace("\r", "
").Replace("\n", "
");

or

return node.InnerXml.Replace("\r", "
").Replace("\n", "
");

Escape xml characters within nodes of string xml in java

You could use regular expression matching to find all the strings between angled brackets, and loop through/process each of those. In this example I've used the Apache Commons Lang to do the XML escaping.

public String sanitiseXml(String xml)
{
// Match the pattern <something>text</something>
Pattern xmlCleanerPattern = Pattern.compile("(<[^/<>]*>)([^<>]*)(</[^<>]*>)");

StringBuilder xmlStringBuilder = new StringBuilder();

Matcher matcher = xmlCleanerPattern.matcher(xml);
int lastEnd = 0;
while (matcher.find())
{
// Include any non-matching text between this result and the previous result
if (matcher.start() > lastEnd) {
xmlStringBuilder.append(xml.substring(lastEnd, matcher.start()));
}
lastEnd = matcher.end();

// Sanitise the characters inside the tags and append the sanitised version
String cleanText = StringEscapeUtils.escapeXml10(matcher.group(2));
xmlStringBuilder.append(matcher.group(1)).append(cleanText).append(matcher.group(3));
}
// Include any leftover text after the last result
xmlStringBuilder.append(xml.substring(lastEnd));

return xmlStringBuilder.toString();
}

This looks for matches of <something>text</something>, captures the tag names and contained text, sanitises the contained text, and then puts it back together.

Escaping string to use in xml tags

As you anticipated, there is no existing way in Python to map from the characters allowed in a filename (for whatever OS) to characters allowed in an XML element name. To be able to do so reversibly would be additionally challenging.

As you also acknowledge, the XML design is unconventional and problematic, for reasons that only begin with the trouble you're currently having regarding allowed characters.

Recommendations, best first:

  1. Fix the problematic design, even if this means fixing upstream and downstream dependencies.

  2. Pre- and/or post-process to map filenames to legal XML element names.

  3. Design and implement the sort of reversible name mapping scheme you have in mind. The level of effort here, combined with the regrettable perpetuation of previous design mistakes, makes this approach unattractive.

See also

  • Allowed symbols in XML element name

Java escape XML token strings

There is no string escaping mechanism for the XML element tag. Some APIs will even reject the name for the new element when it doesn't match the specification for element names. There are at least two possible solutions to your problem:

  1. You can define your own escape mechanism which you use to encode and decode the element name. As an example you could use _ as the escape sequence. The sequence __ (two underscores) will be a literal _ and the sequence _XX or _uXXXX will be the ascii/unicode character you want to write.

  2. You save the column name in an attribute. This way you can save every value in it and even use the XML API of your choice to save the value with the proper encoding.

Escape double quote character in XML

Try this:

"


Related Topics



Leave a reply



Submit