How to Parse Invalid (Bad/Not Well-Formed) Xml

How to parse invalid (bad / not well-formed) XML?

That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.

An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.

Options, most desirable first:

Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
- Standalone: xmlstarlet has robust recovering and repair capabilities^{_{credit: RomanPerekhrest}}
```
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
```
- Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
- Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
  suggestions for dealing with not-well-formed markup in Python,
  including especially lxml's recover=True option.
  See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
- Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
- .NET:
  - XmlReaderSettings.CheckCharacters can
    be disabled to get past illegal XML character problems.
  - @jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
    ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
  - @jdweng also reports that XmlReader.ReadToFollowing() can sometimes
    be used to work-around XML syntactical issues, but note
    rule-breaking warning in #3 below.
  - Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
- Go: Set Decoder.Strict to false as shown in this example by @chuckx.
- PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
- Ruby: Nokogiri supports “Gentle Well-Formedness”.
- R: See htmlTreeParse() for fault-tolerant markup parsing in R.
- Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
- For invalid character errors, use regex to remove/replace invalid characters:
  - PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
  - Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000‌}-\u{FFFD}", ' ')
  - JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
- For ampersands, use regex to replace matches with &:^{_{credit: blhsin, demo}}
```
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
```

Note that the above regular expressions won't take comments or CDATA
sections into account.

Parsing malformed/incomplete/invalid XML files

You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.

UPDATE - example:

public static void main(String[] args) {
    for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
            new Element(Tag.valueOf("p"), ""),
            "")) {
        print(node, 0);
    }
}

public static void print(Node node, int offset) {
    for (int i = 0; i < offset; i++) {
        System.out.print(" ");
    }
    System.out.print(node.nodeName());
    for (Attribute attribute: node.attributes()) {
        System.out.print(", ");
        System.out.print(attribute.getKey() + "=" + attribute.getValue());
    }
    System.out.println();
    for (Node child : node.childNodes()) {
        print(child, offset + 4);
    }
}

Escaping bad XML while parsing

If you can, try using lxml.html. You should be careful though; it ignores namespaces so you need to be sure you're selecting what you intend to select.

Example...

sitemap_products_1.xml (Shortened version of the one you linked to. Notice the second url has a bad loc value.)

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
 xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
 <url>
  <loc>https://www.samsclub.com/sams/mirror-convex/prod13760282.ip</loc>
  <image:image>
   <image:title>See All 160 Degree Convex Security Mirror - 24" w x 15" h</image:title>
   <image:loc>https://scene7.samsclub.com/is/image/samsclub/0003308171524_A</image:loc>
  </image:image>
 </url>
 <url>
  <loc>https://www.samsclub.com/sams/at&t-3-handset-cordless-phone/prod21064454.ip</loc>
  <image:image>
   <image:title>AT&T 3 Handset Cordless Phone</image:title>
   <image:loc>https://scene7.samsclub.com/is/image/samsclub/0065053003067_A</image:loc>
  </image:image>
 </url>
 <url>
  <loc>https://www.samsclub.com/sams/premium-free-flow-waterbed-mattress-kit-queen/104864.ip</loc>
  <image:image>
   <image:title>Premium Free Flow Waterbed Mattress Kit- Queen</image:title>
   <image:loc>https://scene7.samsclub.com/is/image/samsclub/0040649555859_A</image:loc>
  </image:image>
 </url>
</urlset>

Python 3.x

from lxml import html

tree = html.parse("sitemap_products_1.xml")

for elem in tree.findall(".//url/loc"):
    print(elem.text)

Output (Notice the second url is printed in its entirety.)

https://www.samsclub.com/sams/mirror-convex/prod13760282.ip
https://www.samsclub.com/sams/at&t-3-handset-cordless-phone/prod21064454.ip
https://www.samsclub.com/sams/premium-free-flow-waterbed-mattress-kit-queen/104864.ip

Java DOM transforming and parsing arbitrary strings with invalid XML characters?

As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string) which can be used in the following way.

    String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
    Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
    Element element = document.createElement("element");
    element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
    document.appendChild(element);
    TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
    // creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text<text&text##</element>
    document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
    System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
    // prints true

escapeInvalidXmlCharacters(String string) and unescapeInvalidXmlCharacters(String string):

/**
 * Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
 * DOM API already escapes predefined entities, such as {@code "}, {@code &},
 * {@code '}, {@code <} and {@code >} for
 * <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
 * code points are ignored by this function. However, there are some other
 * invalid XML Unicode code points, such as {@code '\u0000'}, which are even
 * invalid in their escaped form, such as {@code "�"}.
 * <p>
 * This function replaces all {@code '#'} by {@code "##"} and all Unicode code
 * points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
 * [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
 * {@code "#c;"}, where <code>c</code> is the Unicode code point.
 * 
 * @param string the <code>{@link String}</code> to be escaped
 * @return the escaped <code>{@link String}</code>
 * @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
 */
public static String escapeInvalidXmlCharacters(String string) {
    StringBuilder stringBuilder = new StringBuilder();

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (codePoint == '#') {
            stringBuilder.append("##");
        } else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
            stringBuilder.appendCodePoint(codePoint);
        } else {
            stringBuilder.append("#" + codePoint + ";");
        }
    }

    return stringBuilder.toString();
}

/**
 * Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
 * Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
 * 
 * @param string the <code>{@link String}</code> to be unescaped
 * @return the unescaped <code>{@link String}</code>
 * @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
 */
public static String unescapeInvalidXmlCharacters(String string) {
    StringBuilder stringBuilder = new StringBuilder();
    boolean escaped = false;

    for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
        codePoint = string.codePointAt(i);

        if (escaped) {
            stringBuilder.appendCodePoint(codePoint);
            escaped = false;
        } else if (codePoint == '#') {
            StringBuilder intBuilder = new StringBuilder();
            int j;

            for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
                codePoint = string.codePointAt(j);

                if (codePoint == ';') {
                    escaped = true;
                    break;
                }

                if (codePoint >= 48 && codePoint <= 57) {
                    intBuilder.appendCodePoint(codePoint);
                } else {
                    break;
                }
            }

            if (escaped) {
                try {
                    codePoint = Integer.parseInt(intBuilder.toString());
                    stringBuilder.appendCodePoint(codePoint);
                    escaped = false;
                    i = j;
                } catch (IllegalArgumentException e) {
                    codePoint = '#';
                    escaped = true;
                }
            } else {
                codePoint = '#';
                escaped = true;
            }
        } else {
            stringBuilder.appendCodePoint(codePoint);
        }
    }

    return stringBuilder.toString();
}

Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.

How to Parse Invalid (Bad/Not Well-Formed) Xml