How to parse invalid (bad / not well-formed) XML?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml'srecover=True
option.
See also this answer for how to usecodecs.EncodedFile()
to cleanup illegal characters.Java: TagSoup and JSoup focus on HTML.
FilterInputStream
can be used for preprocessing cleanup..NET:
- XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems. - @jdweng notes that
XmlReaderSettings.ConformanceLevel
can be set toConformanceLevel.Fragment
so thatXmlReader
can read XML Well-Formed Parsed Entities lacking a root element. - @jdweng also reports that
XmlReader.ReadToFollowing()
can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below. Microsoft.Language.Xml.XMLParser
is said to be “error-tolerant”.
- XmlReaderSettings.CheckCharacters can
Go: Set
Decoder.Strict
tofalse
as shown in this example by @chuckx.PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.For invalid character errors, use regex to remove/replace invalid characters:
- PHP:
preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
- Ruby:
string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')
- JavaScript:
inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
- PHP:
For ampersands, use regex to replace matches with
&
: credit: blhsin, demo&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
Parsing malformed/incomplete/invalid XML files
You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.
UPDATE - example:
public static void main(String[] args) {
for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
new Element(Tag.valueOf("p"), ""),
"")) {
print(node, 0);
}
}
public static void print(Node node, int offset) {
for (int i = 0; i < offset; i++) {
System.out.print(" ");
}
System.out.print(node.nodeName());
for (Attribute attribute: node.attributes()) {
System.out.print(", ");
System.out.print(attribute.getKey() + "=" + attribute.getValue());
}
System.out.println();
for (Node child : node.childNodes()) {
print(child, offset + 4);
}
}
Escaping bad XML while parsing
If you can, try using lxml.html
. You should be careful though; it ignores namespaces so you need to be sure you're selecting what you intend to select.
Example...
sitemap_products_1.xml (Shortened version of the one you linked to. Notice the second url
has a bad loc
value.)
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>https://www.samsclub.com/sams/mirror-convex/prod13760282.ip</loc>
<image:image>
<image:title>See All 160 Degree Convex Security Mirror - 24" w x 15" h</image:title>
<image:loc>https://scene7.samsclub.com/is/image/samsclub/0003308171524_A</image:loc>
</image:image>
</url>
<url>
<loc>https://www.samsclub.com/sams/at&t-3-handset-cordless-phone/prod21064454.ip</loc>
<image:image>
<image:title>AT&T 3 Handset Cordless Phone</image:title>
<image:loc>https://scene7.samsclub.com/is/image/samsclub/0065053003067_A</image:loc>
</image:image>
</url>
<url>
<loc>https://www.samsclub.com/sams/premium-free-flow-waterbed-mattress-kit-queen/104864.ip</loc>
<image:image>
<image:title>Premium Free Flow Waterbed Mattress Kit- Queen</image:title>
<image:loc>https://scene7.samsclub.com/is/image/samsclub/0040649555859_A</image:loc>
</image:image>
</url>
</urlset>
Python 3.x
from lxml import html
tree = html.parse("sitemap_products_1.xml")
for elem in tree.findall(".//url/loc"):
print(elem.text)
Output (Notice the second url is printed in its entirety.)
https://www.samsclub.com/sams/mirror-convex/prod13760282.ip
https://www.samsclub.com/sams/at&t-3-handset-cordless-phone/prod21064454.ip
https://www.samsclub.com/sams/premium-free-flow-waterbed-mattress-kit-queen/104864.ip
Java DOM transforming and parsing arbitrary strings with invalid XML characters?
As @VGR and @kjhughes have pointed out in the comments below the question, Base64 is indeed a possible answer to my question. I do now have a further solution for my problem, which is based on escaping. I have written 2 functions escapeInvalidXmlCharacters(String string)
and unescapeInvalidXmlCharacters(String string)
which can be used in the following way.
String string = "text#text##text#0;text" + '\u0000' + "text<text&text#";
Document document = DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument();
Element element = document.createElement("element");
element.appendChild(document.createTextNode(escapeInvalidXmlCharacters(string)));
document.appendChild(element);
TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(new File("test.xml")));
// creates <?xml version="1.0" encoding="UTF-8" standalone="no"?><element>text##text####text##0;text#0;text<text&text##</element>
document = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new File("test.xml"));
System.out.println(unescapeInvalidXmlCharacters(document.getDocumentElement().getTextContent()).equals(string));
// prints true
escapeInvalidXmlCharacters(String string)
and unescapeInvalidXmlCharacters(String string)
:
/**
* Escapes invalid XML Unicode code points in a <code>{@link String}</code>. The
* DOM API already escapes predefined entities, such as {@code "}, {@code &},
* {@code '}, {@code <} and {@code >} for
* <code>{@link org.w3c.dom.Text Text}</code> nodes. Therefore, these Unicode
* code points are ignored by this function. However, there are some other
* invalid XML Unicode code points, such as {@code '\u0000'}, which are even
* invalid in their escaped form, such as {@code ""}.
* <p>
* This function replaces all {@code '#'} by {@code "##"} and all Unicode code
* points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] |
* [#xE000-#xFFFD] | [#x10000-#x10FFFF] by the <code>{@link String}</code>
* {@code "#c;"}, where <code>c</code> is the Unicode code point.
*
* @param string the <code>{@link String}</code> to be escaped
* @return the escaped <code>{@link String}</code>
* @see <code>{@link #unescapeInvalidXmlCharacters(String)}</code>
*/
public static String escapeInvalidXmlCharacters(String string) {
StringBuilder stringBuilder = new StringBuilder();
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (codePoint == '#') {
stringBuilder.append("##");
} else if (codePoint == 0x9 || codePoint == 0xA || codePoint == 0xD || codePoint >= 0x20 && codePoint <= 0xD7FF || codePoint >= 0xE000 && codePoint <= 0xFFFD || codePoint >= 0x10000 && codePoint <= 0x10FFFF) {
stringBuilder.appendCodePoint(codePoint);
} else {
stringBuilder.append("#" + codePoint + ";");
}
}
return stringBuilder.toString();
}
/**
* Unescapes invalid XML Unicode code points in a <code>{@link String}</code>.
* Makes <code>{@link #escapeInvalidXmlCharacters(String)}</code> undone.
*
* @param string the <code>{@link String}</code> to be unescaped
* @return the unescaped <code>{@link String}</code>
* @see <code>{@link #escapeInvalidXmlCharacters(String)}</code>
*/
public static String unescapeInvalidXmlCharacters(String string) {
StringBuilder stringBuilder = new StringBuilder();
boolean escaped = false;
for (int i = 0, codePoint = 0; i < string.length(); i += Character.charCount(codePoint)) {
codePoint = string.codePointAt(i);
if (escaped) {
stringBuilder.appendCodePoint(codePoint);
escaped = false;
} else if (codePoint == '#') {
StringBuilder intBuilder = new StringBuilder();
int j;
for (j = i + 1; j < string.length(); j += Character.charCount(codePoint)) {
codePoint = string.codePointAt(j);
if (codePoint == ';') {
escaped = true;
break;
}
if (codePoint >= 48 && codePoint <= 57) {
intBuilder.appendCodePoint(codePoint);
} else {
break;
}
}
if (escaped) {
try {
codePoint = Integer.parseInt(intBuilder.toString());
stringBuilder.appendCodePoint(codePoint);
escaped = false;
i = j;
} catch (IllegalArgumentException e) {
codePoint = '#';
escaped = true;
}
} else {
codePoint = '#';
escaped = true;
}
} else {
stringBuilder.appendCodePoint(codePoint);
}
}
return stringBuilder.toString();
}
Note that these functions are probably very inefficient and can be written in a better way. Feel free to post suggestions to improve the code in the comments.
Related Topics
Android/Java - Date Difference in Days
Why Do I Get "Failed to Bounce to Type" When I Turn Json from Firebase into Java Objects
Android: Internet Connectivity Change Listener
Cannot Create or Edit Android Virtual Devices (Avd) from Eclipse, Adt 22.6
Populating a Listview Using an Arraylist
How to Implement Custom Action Bar With Custom Buttons in Android
Android 5.0 - Add Header/Footer to a Recyclerview
Java Using Much More Memory Than Heap Size (Or Size Correctly Docker Memory Limit)
Virtual Memory Usage from Java Under Linux, Too Much Memory Used
"No X11 Display Variable" - What Does It Mean
How to Stop Java Process Gracefully
Classpath Does Not Work Under Linux
Java Can't Connect to X11 Window Server Using 'Localhost:10.0' as the Value of the Display Variable
Technically What Is the Main Difference Between Oracle Jdk and Openjdk
Javafx on Linux Is Showing a "Graphics Device Initialization Failed For: Es2, Sw"