Removing Invalid Xml Characters from a String in Java

removing invalid XML characters from a string in java

Java's regex supports supplementary characters, so you can specify those high ranges with two UTF-16 encoded chars.

Here is the pattern for removing characters that are illegal in XML 1.0:

// XML 1.0
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml10pattern = "[^"
                    + "\u0009\r\n"
                    + "\u0020-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]";

Most people will want the XML 1.0 version.

Here is the pattern for removing characters that are illegal in XML 1.1:

// XML 1.1
// [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
String xml11pattern = "[^"
                    + "\u0001-\uD7FF"
                    + "\uE000-\uFFFD"
                    + "\ud800\udc00-\udbff\udfff"
                    + "]+";

You will need to use String.replaceAll(...) and not String.replace(...).

String illegal = "Hello, World!\0";
String legal = illegal.replaceAll(pattern, "");

Stripping Invalid XML characters in Java

I haven't used this personally but Atlassian made a command line XML cleaner that may suit your needs (it was made mainly for JIRA but XML is XML):

Download atlassian-xml-cleaner-0.1.jar

Open a DOS console or shell, and locate the XML or ZIP backup file on your computer, here assumed to be called data.xml

Run:
java -jar atlassian-xml-cleaner-0.1.jar data.xml > data-clean.xml

This will write a copy of data.xml to data-clean.xml, with invalid characters removed.

Detect non valid XML characters

1) it works both ways, \u0009 is java escape sequence, \\u0009 is regex escape sequence

2) Java String is UTF-16 encoded, U+10000 is encoded with 2 16-bit characters \ud800\udc00, see Character API Unicode Character Representations

How to replace invalid characters using java

Use this:

String replaced = your_original_string.replaceAll("\\x10", "");

The xdd... is the Java syntax to match a single unicode character
Your error said Unicode: 0x10

Remove illegal xml characters from UTF-16LE encoded file

Certainly one problem is that readLine throws away the line ending.

You would need to do something like:

       fileText += line + "\r\n";

Otherwise XML attributes, DTD entities, or something else could get glued together where at least a space was required. Also you do not want the text content to be altered when it contains a line break.

Performance (speed and memory) can be improved using a

StringBuilder fileText = new StringBuilder();
... fileText.append(line).append("\n");
... fileText.toString();

Then there might be a problem with the first character of the file, which
sometimes redundantly is added: a BOM char.

line = line.replace("\uFEFF", "");

Java DOM transforming and parsing arbitrary strings with invalid XML characters?

As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string) which can be used in the following way.

    String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
    Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    Element element = document.createElement("element");
    element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
    document.appendChild(element);
    TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
    // creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text<text&text##</element>
    document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
    System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
    // prints true

escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string):

/**
 * Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
 * DOM API already escapes predefined entities, such as {@code "}, {@code &},
 * {@code '}, {@code <} and {@code >} for
 * <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
 * code points are ignored by this function. However, there are some other
 * invalid XML Unicode code points, such as {@code '\u0000'}, which are even
 * invalid in their escaped form, such as {@code "�"}.
 * <p>
 * This function replaces all {@code '#'} by {@code "##"} and all Unicode code
 * points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
 * [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
 * {@code "#c;"}, where <code>c</code> is the Unicode code point.
 * 
 * @param string the <code>{@link String}</code> to be escaped
 * @return the escaped <code>{@link String}</code>
 * @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
 */
public static String escapeInvalidXmlCharacters(String string) {
    StringBuilder stringBuilder = new StringBuilder();

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (codePoint == '#') {
            stringBuilder.append("##");
        } else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
            stringBuilder.appendCodePoint(codePoint);
        } else {
            stringBuilder.append("#" + codePoint + ";");
        }
    }

    return stringBuilder.toString();
}

/**
 * Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
 * Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
 * 
 * @param string the <code>{@link String}</code> to be unescaped
 * @return the unescaped <code>{@link String}</code>
 * @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
 */
public static String unescapeInvalidXmlCharacters(String string) {
    StringBuilder stringBuilder = new StringBuilder();
    boolean escaped = false;

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (escaped) {
            stringBuilder.appendCodePoint(codePoint);
            escaped = false;
        } else if (codePoint == '#') {
            StringBuilder intBuilder = new StringBuilder();
            int j;

            for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
                codePoint = string.codePointAt(j);

                if (codePoint == ';') {
                    escaped = true;
                    break;
                }

                if (codePoint >= 48 && codePoint <= 57) {
                    intBuilder.appendCodePoint(codePoint);
                } else {
                    break;
                }
            }

            if (escaped) {
                try {
                    codePoint = Integer.parseInt(intBuilder.toString());
                    stringBuilder.appendCodePoint(codePoint);
                    escaped = false;
                    i = j;
                } catch (IllegalArgumentException e) {
                    codePoint = '#';
                    escaped = true;
                }
            } else {
                codePoint = '#';
                escaped = true;
            }
        } else {
            stringBuilder.appendCodePoint(codePoint);
        }
    }

    return stringBuilder.toString();
}

Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.

Filtering illegal XML characters in Java

It's not trivial to find out all the invalid chars for XML. You need to call or reimplement the XMLChar.isInvalid() from Xerces,

http://kickjava.com/src/org/apache/xerces/util/XMLChar.java.htm

Removing Invalid Xml Characters from a String in Java