XML splitting of BIG file using Java
Assuming a flat structure where the root element of the document R has a large number of children named X, the following XSLT 2.0 transformation will split the file every Nth X element.
<t:transform xmlns:t="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<t:param name="N" select="100"/>
<t:template match="/*">
<t:for-each-group select="X"
group-adjacent="(position()-1) idiv $N">
<t:result-document href="{position()}.xml">
<R>
<t:copy-of select="current-group()"/>
</R>
</t:result-document>
</t:for-each-group>
</t:template>
</t:transform>
If you want to run this in streaming mode (without building the source tree in memory), then (a) add <xsl:mode streamable="yes"/>
, and (b) run it using an XSLT 3.0 processor (Saxon-EE or Exselt).
Splitting a larger size XML file using Java (Retaining Parent's attributes and Siblings)
Consider using XSLT, the declarative, special-purpose programming language to transform XML documents instead of XPath as you require whole document transformation. For your purposes, an embedded, dynamic XSLT run on a loop of values can output multiple XML files:
XSLT Script (embedded below, example here uses 'abc' which is iteratively used and replaced)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>
<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="child[not(@value='abc')]"/>
</xsl:transform>
Java Script
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.OutputKeys;
import java.io.*;
import java.net.URISyntaxException;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class XmlSplit {
public static void main(String[] args) throws IOException, URISyntaxException,
SAXException, ParserConfigurationException,
TransformerException {
// Load XML Source
String inputXML = "/path/to/XMLSource.xml";
// Declare XML Values Array
String[] xmlVals = {"abc", "xyz"};
// Iterate through Values running dynamic, embedded XSLT
for (String s: xmlVals) {
String outputXML = "/path/to/output_" + s + ".xml";
String xslStr = String.join("\n",
"<xsl:transform xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">",
"<xsl:output version=\"1.0\" encoding=\"UTF-8\" indent=\"yes\" />",
"<xsl:strip-space elements=\"*\"/>",
"<xsl:template match=\"@*|node()\">",
"<xsl:copy>",
"<xsl:apply-templates select=\"@*|node()\"/>",
"</xsl:copy>",
"</xsl:template>",
"<xsl:template match=\"child[not(@value='"+ s +"')]\"/>",
"</xsl:transform>");
Source xslt = new StreamSource(new StringReader(xslStr));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse (new File(inputXML));
// XSLT Transformation with pretty print
TransformerFactory prettyPrint = TransformerFactory.newInstance();
Transformer transformer = prettyPrint.newTransformer(xslt);
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.STANDALONE, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
// Output Result to File
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(outputXML));
transformer.transform(source, result);
}
}
}
Java 8 - Split huge XML file using Stax gives unexpected results
Thanks to Andreas, this is the solution:
String testCars = "<root><car><name>car1</name></car><other><something>Unknown</something></other><car><name>car2</name></car></root>";
XMLInputFactory factory = XMLInputFactory.newInstance();
try {
XMLStreamReader streamReader = factory.createXMLStreamReader(new StringReader(testCars));
streamReader.nextTag();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
streamReader.nextTag();
while ( streamReader.isStartElement() ||
( ! streamReader.hasNext() && streamReader.nextTag() == XMLStreamConstants.START_ELEMENT)) {
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
t.transform(new StAXSource(streamReader), result);
System.out.println( "XmlElement: " + writer.toString());
}
} catch (Exception e) { ... }
Input is:
<root>
<car>
<name>car1</name>
</car>
<other>
<something>Unknown</something>
</other>
<car>
<name>car2</name>
</car>
</root>
Output is:
XmlElement: <?xml version="1.0" encoding="UTF-8"?><car><name>car1</name></car>
XmlElement: <?xml version="1.0" encoding="UTF-8"?><other><something>Unknown</something></other>
XmlElement: <?xml version="1.0" encoding="UTF-8"?><car><name>car2</name></car>
There is a Java Library for splitting a large XML file into smaller valid XML with a max KB size?
There is no automatic way to split a big xml in several smaller xml.
As an extreme simplification a single xml represent a single object with properties.
Splitting it in different xmls means splitting a single object in multiple objects. This is not something that can be done automatically.
Let show a simple example. Imagine to have this xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
How do you split it? Is the following a valid way to split it? (It is a business decision how to split and recombine it).
<note>
<to>Tove</to>
<from>Jani</from>
</note>
<note>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
If the problem is not related to spliting a big xml to smaller xmls, but to split a single big file to smaller files you can split it as
<note>
<to>Tove</to>
<from>Jani</from>
and
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
But if the problem is the size of the file to send it over the internet or to save space when saving it, consider also to compress it. Compressing an xml file results in a very smaller compressed result. Eventually you can split the compressed file.
If the problem instead is to hold in memory the whole file simply don't do that. Use a SAX parser instead of a DOM parser so you can hold in memory just a little portion of the original xml. A Sax parser is:
SAX (Simple API for XML) is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list.1 SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole—building the full abstract syntax tree of an XML document for convenience of the user—SAX parsers operate on each piece of the XML document sequentially, issuing parsing events while making a single pass[clarification needed] through the input stream.
Splitting Large XML FIles Java (StAX)
Name of method openOutputFileAndWriteHeader
hints that it will create new file. It's not proper place for footer.
In BigXmlTestIteratorApi.java from line 74 you can see code:
xmlEventWriter.close(); // Also closes any open Element(s) and the document
xmlEventWriter = openOutputFileAndWriteHeader(++fileNumber); // Continue with next file
dataRepetitions = 0;
To add footer, you need add something before closing file:
writeFooter(footer);
xmlEventWriter.close();
xmlEventWriter = openOutputFileAndWriteHeader(++fileNumber);
dataRepetitions = 0;
Please note that creating instance of FooterType
for each file may be superfluous. It could be created outside loop, for example at 60 line
Splitting a large XML file with Apache Camel using split, stax, jaxb
Use PartRecord instead PartRecords in your router:
from("sftp:localhost:22/in")
.split(stax(PartRecord.class)).streaming()
.marshal().json(JsonLibrary.Jackson, true)
.to("rabbitmq://rabbitmq:5672/myExchange?queue=partQueue&routingKey=queue.part")
.end();
Splitting of a large XML file into small Chunks based on repeated elements
It is a very simple modification to your existing code. Actually there are multiple ways to do this. I am gonna just show you one of them: by explicitly comparing the attr val using VTDNav's getAttrVal methods().
public static void main1(String args[]) {
try {
VTDGen vg = new VTDGen();
if (vg.parseFile("C:\\..\\example.xml", true)) {
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/Parents/process");
int chunk = 0;
FileOutputStream fopsA=(new FileOutputStream("C:\\....\\resultA" + chunk + ".xml"));
fopsA.write("<Parent>\n".getBytes());
FileOutputStream fopsB=(new FileOutputStream("C:\\....\\resultB" + chunk + ".xml"));
while (( ap.evalXPath()) != -1) {
long frag = vn.getElementFragment();
int i=vn.getAttrVal("Child");
if (i==-1) throw new NavException("unexpected result");
if (vn.compareTokenString(i,"A")==0){
fopsA.write(vn.getXML().getBytes(), (int) frag,
(int) (frag >> 32));
}else if (vn.compareTokenString(i,"B")==0){
fopsB.write(vn.getXML().getBytes(), (int) frag,
(int) (frag >> 32));
}
chunk++;
}
fopsA.write("</Parent>\n".getBytes());
fopsB.write("</Parent>\n".getBytes());
}
} catch (Exception ex) {
ex.printStackTrace();
}
Related Topics
Simple Way to Count Character Occurrences in a String
How to Capture Global Key Presses in Java
Recursively List All Files Within a Directory Using Nio.File.Directorystream;
Printing My MAC's Serial Number in Java Using Unix Commands
How to Monitor Java Memory Usage
Getting Database Connection in Pure JPA Setup
What Is "Compiler Compliance Level" in Eclipse
How to Make Lombok and Aspectj Work Together
Adding Chartpanel to Jtabbedpane Using JPAnel
JSONmanagedreference VS JSONbackreference
Java Ternary Without Assignment
Double Calculation Producing Odd Result
Java: How to Access a Class's Field by a Name Stored in a Variable
How to Change MySQL Timezone in a Database Connection Using Java