removing invalid XML characters from a string in java
Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.
Here is the pattern for removing characters that are illegal in XML 1.0:
// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
Most people will want the XML 1.0 version.
Here is the pattern for removing characters that are illegal in XML 1.1:
// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
+ "\u0001-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]+";
You will need to use String.replaceAll(...)
and not String.replace(...)
.
String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");
Stripping Invalid XML characters in Java
I haven't used this personally but Atlassian made a command line XML cleaner that may suit your needs (it was made mainly for JIRA but XML is XML):
Download atlassian-xml-cleaner-0.1.jar
Open a DOS console or shell, and locate the XML or ZIP backup file on your computer, here assumed to be called data.xml
Run:
java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xmlThis will write a copy of data.xml to data-clean.xml, with invalid characters removed.
Detect non valid XML characters
1) it works both ways, \u0009
is java escape sequence, \\u0009
is regex escape sequence
2) Java String is UTF-16 encoded, U+10000 is encoded with 2 16-bit characters \ud800\udc00
, see Character API Unicode Character Representations
How to replace invalid characters using java
Use this:
String replaced = your_original_string.replaceAll("\\x10", "");
- The
xdd...
is the Java syntax to match a single unicode character - Your error said
Unicode: 0x10
Remove illegal xml characters from UTF-16LE encoded file
Certainly one problem is that readLine
throws away the line ending.
You would need to do something like:
fileText += line + "\r\n";
Otherwise XML attributes, DTD entities, or something else could get glued together where at least a space was required. Also you do not want the text content to be altered when it contains a line break.
Performance (speed and memory) can be improved using a
StringBuilder fileText = new StringBuilder();
... fileText.append(line).append("\n");
... fileText.toString();
Then there might be a problem with the first character of the file, which
sometimes redundantly is added: a BOM char.
line = line.replace("\uFEFF", "");
Java DOM transforming and parsing arbitrary strings with invalid XML characters?
As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string)
and unescapeInvalidXmlCharacters(String string)
which can be used in the following way.
String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element element = document.createElement("element");
element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
document.appendChild(element);
TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
// creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text<text&text##</element>
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
// prints true
escapeInvalidXmlCharacters(String string)
and unescapeInvalidXmlCharacters(String string)
:
/**
* Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
* DOM API already escapes predefined entities, such as {@code "}, {@code &},
* {@code '}, {@code <} and {@code >} for
* <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
* code points are ignored by this function. However, there are some other
* invalid XML Unicode code points, such as {@code '\u0000'}, which are even
* invalid in their escaped form, such as {@code ""}.
* <p>
* This function replaces all {@code '#'} by {@code "##"} and all Unicode code
* points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
* [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
* {@code "#c;"}, where <code>c</code> is the Unicode code point.
*
* @param string the <code>{@link String}</code> to be escaped
* @return the escaped <code>{@link String}</code>
* @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
*/
public static String escapeInvalidXmlCharacters(String string) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (codePoint == '#') {
stringBuilder.append("##");
} else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
stringBuilder.appendCodePoint(codePoint);
} else {
stringBuilder.append("#" + codePoint + ";");
}
}
return stringBuilder.toString();
}
/**
* Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
* Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
*
* @param string the <code>{@link String}</code> to be unescaped
* @return the unescaped <code>{@link String}</code>
* @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
*/
public static String unescapeInvalidXmlCharacters(String string) {
StringBuilder stringBuilder = new StringBuilder();
boolean escaped = false;
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (escaped) {
stringBuilder.appendCodePoint(codePoint);
escaped = false;
} else if (codePoint == '#') {
StringBuilder intBuilder = new StringBuilder();
int j;
for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
codePoint = string.codePointAt(j);
if (codePoint == ';') {
escaped = true;
break;
}
if (codePoint >= 48 && codePoint <= 57) {
intBuilder.appendCodePoint(codePoint);
} else {
break;
}
}
if (escaped) {
try {
codePoint = Integer.parseInt(intBuilder.toString());
stringBuilder.appendCodePoint(codePoint);
escaped = false;
i = j;
} catch (IllegalArgumentException e) {
codePoint = '#';
escaped = true;
}
} else {
codePoint = '#';
escaped = true;
}
} else {
stringBuilder.appendCodePoint(codePoint);
}
}
return stringBuilder.toString();
}
Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.
Filtering illegal XML characters in Java
It's not trivial to find out all the invalid chars for XML. You need to call or reimplement the XMLChar.isInvalid() from Xerces,
http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm
Related Topics
Springboot 2.6.0/Spring Fox 3 - Failed to Start Bean 'Documentationpluginsbootstrapper'
The Server Time Zone Value 'Aest' Is Unrecognized or Represents More Than One Time Zone
Java: How to Convert a Utc Timestamp to Local Time
Creating a Triangle with for Loops
How to Programmatically Set the Sslcontext of a Jax-Ws Client
How to Negate a Method Reference Predicate
Replacing All Non-Alphanumeric Characters with Empty Strings
Should I Use String.Isempty() or "".Equals(String)
Difference Between Arrays.Aslist(Array) and New Arraylist<Integer>(Arrays.Aslist(Array))
How to Manage Rest API Versioning with Spring
Spring: Why Do We Autowire the Interface and Not the Implemented Class
A For-Loop to Iterate Over an Enum in Java
Can You Explain the Httpurlconnection Connection Process
Delete Item from Array and Shrink Array
How to Inject an Object into Jersey Request Context
Java.Lang.Classcastexception Using Lambda Expressions in Spark Job on Remote Server