Split 1Gb Xml File Using Java

XML splitting of BIG file using Java

Assuming a flat structure where the root element of the document R has a large number of children named X, the following XSLT 2.0 transformation will split the file every Nth X element.

<t:transform xmlns:t="http://www.w3.org/1999/XSL/Transform"
version="2.0">
<t:param name="N" select="100"/>
<t:template match="/*">
<t:for-each-group select="X"
group-adjacent="(position()-1) idiv $N">
<t:result-document href="{position()}.xml">
<R>
<t:copy-of select="current-group()"/>
</R>
</t:result-document>
</t:for-each-group>
</t:template>
</t:transform>

If you want to run this in streaming mode (without building the source tree in memory), then (a) add <xsl:mode streamable="yes"/>, and (b) run it using an XSLT 3.0 processor (Saxon-EE or Exselt).

Splitting a larger size XML file using Java (Retaining Parent's attributes and Siblings)

Consider using XSLT, the declarative, special-purpose programming language to transform XML documents instead of XPath as you require whole document transformation. For your purposes, an embedded, dynamic XSLT run on a loop of values can output multiple XML files:

XSLT Script (embedded below, example here uses 'abc' which is iteratively used and replaced)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output version="1.0" encoding="UTF-8" indent="yes" />
<xsl:strip-space elements="*"/>

<!-- Identity Transform -->
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="child[not(@value='abc')]"/>

</xsl:transform>

Java Script

import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import javax.xml.transform.stream.StreamSource;
import javax.xml.transform.OutputKeys;

import java.io.*;
import java.net.URISyntaxException;

import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class XmlSplit {
public static void main(String[] args) throws IOException, URISyntaxException,
SAXException, ParserConfigurationException,
TransformerException {

// Load XML Source
String inputXML = "/path/to/XMLSource.xml";

// Declare XML Values Array
String[] xmlVals = {"abc", "xyz"};

// Iterate through Values running dynamic, embedded XSLT
for (String s: xmlVals) {
String outputXML = "/path/to/output_" + s + ".xml";

String xslStr = String.join("\n",
"<xsl:transform xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">",
"<xsl:output version=\"1.0\" encoding=\"UTF-8\" indent=\"yes\" />",
"<xsl:strip-space elements=\"*\"/>",
"<xsl:template match=\"@*|node()\">",
"<xsl:copy>",
"<xsl:apply-templates select=\"@*|node()\"/>",
"</xsl:copy>",
"</xsl:template>",
"<xsl:template match=\"child[not(@value='"+ s +"')]\"/>",
"</xsl:transform>");

Source xslt = new StreamSource(new StringReader(xslStr));
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse (new File(inputXML));

// XSLT Transformation with pretty print
TransformerFactory prettyPrint = TransformerFactory.newInstance();
Transformer transformer = prettyPrint.newTransformer(xslt);

transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
transformer.setOutputProperty(OutputKeys.STANDALONE, "yes");
transformer.setOutputProperty(OutputKeys.METHOD, "xml");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

// Output Result to File
DOMSource source = new DOMSource(doc);
StreamResult result = new StreamResult(new File(outputXML));
transformer.transform(source, result);
}

}
}

Java 8 - Split huge XML file using Stax gives unexpected results

Thanks to Andreas, this is the solution:

String testCars = "<root><car><name>car1</name></car><other><something>Unknown</something></other><car><name>car2</name></car></root>";
XMLInputFactory factory = XMLInputFactory.newInstance();
try {
XMLStreamReader streamReader = factory.createXMLStreamReader(new StringReader(testCars));
streamReader.nextTag();
TransformerFactory tf = TransformerFactory.newInstance();
Transformer t = tf.newTransformer();
streamReader.nextTag();
while ( streamReader.isStartElement() ||
( ! streamReader.hasNext() && streamReader.nextTag() == XMLStreamConstants.START_ELEMENT)) {
StringWriter writer = new StringWriter();
StreamResult result = new StreamResult(writer);
t.transform(new StAXSource(streamReader), result);
System.out.println( "XmlElement: " + writer.toString());
}
} catch (Exception e) { ... }

Input is:

<root>
<car>
<name>car1</name>
</car>
<other>
<something>Unknown</something>
</other>
<car>
<name>car2</name>
</car>
</root>

Output is:

XmlElement: <?xml version="1.0" encoding="UTF-8"?><car><name>car1</name></car>
XmlElement: <?xml version="1.0" encoding="UTF-8"?><other><something>Unknown</something></other>
XmlElement: <?xml version="1.0" encoding="UTF-8"?><car><name>car2</name></car>

There is a Java Library for splitting a large XML file into smaller valid XML with a max KB size?

There is no automatic way to split a big xml in several smaller xml.

As an extreme simplification a single xml represent a single object with properties.
Splitting it in different xmls means splitting a single object in multiple objects. This is not something that can be done automatically.

Let show a simple example. Imagine to have this xml

<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

How do you split it? Is the following a valid way to split it? (It is a business decision how to split and recombine it).

<note>
<to>Tove</to>
<from>Jani</from>
</note>

<note>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

If the problem is not related to spliting a big xml to smaller xmls, but to split a single big file to smaller files you can split it as

<note>
<to>Tove</to>
<from>Jani</from>

and

  <heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

But if the problem is the size of the file to send it over the internet or to save space when saving it, consider also to compress it. Compressing an xml file results in a very smaller compressed result. Eventually you can split the compressed file.

If the problem instead is to hold in memory the whole file simply don't do that. Use a SAX parser instead of a DOM parser so you can hold in memory just a little portion of the original xml. A Sax parser is:

SAX (Simple API for XML) is an event-driven online algorithm for parsing XML documents, with an API developed by the XML-DEV mailing list.1 SAX provides a mechanism for reading data from an XML document that is an alternative to that provided by the Document Object Model (DOM). Where the DOM operates on the document as a whole—building the full abstract syntax tree of an XML document for convenience of the user—SAX parsers operate on each piece of the XML document sequentially, issuing parsing events while making a single pass[clarification needed] through the input stream.

Splitting Large XML FIles Java (StAX)

Name of method openOutputFileAndWriteHeader hints that it will create new file. It's not proper place for footer.

In BigXmlTestIteratorApi.java from line 74 you can see code:

xmlEventWriter.close(); // Also closes any open Element(s) and the document
xmlEventWriter = openOutputFileAndWriteHeader(++fileNumber); // Continue with next file
dataRepetitions = 0;

To add footer, you need add something before closing file:

writeFooter(footer);
xmlEventWriter.close();
xmlEventWriter = openOutputFileAndWriteHeader(++fileNumber);
dataRepetitions = 0;

Please note that creating instance of FooterType for each file may be superfluous. It could be created outside loop, for example at 60 line

Splitting a large XML file with Apache Camel using split, stax, jaxb

Use PartRecord instead PartRecords in your router:

 from("sftp:localhost:22/in")
.split(stax(PartRecord.class)).streaming()
.marshal().json(JsonLibrary.Jackson, true)
.to("rabbitmq://rabbitmq:5672/myExchange?queue=partQueue&routingKey=queue.part")
.end();

Splitting of a large XML file into small Chunks based on repeated elements

It is a very simple modification to your existing code. Actually there are multiple ways to do this. I am gonna just show you one of them: by explicitly comparing the attr val using VTDNav's getAttrVal methods().

public static void main1(String args[]) {
try {
VTDGen vg = new VTDGen();
if (vg.parseFile("C:\\..\\example.xml", true)) {
VTDNav vn = vg.getNav();
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath("/Parents/process");
int chunk = 0;
FileOutputStream fopsA=(new FileOutputStream("C:\\....\\resultA" + chunk + ".xml"));
fopsA.write("<Parent>\n".getBytes());
FileOutputStream fopsB=(new FileOutputStream("C:\\....\\resultB" + chunk + ".xml"));
while (( ap.evalXPath()) != -1) {
long frag = vn.getElementFragment();
int i=vn.getAttrVal("Child");
if (i==-1) throw new NavException("unexpected result");
if (vn.compareTokenString(i,"A")==0){

fopsA.write(vn.getXML().getBytes(), (int) frag,
(int) (frag >> 32));

}else if (vn.compareTokenString(i,"B")==0){

fopsB.write(vn.getXML().getBytes(), (int) frag,
(int) (frag >> 32));
}
chunk++;
}

fopsA.write("</Parent>\n".getBytes());
fopsB.write("</Parent>\n".getBytes());
}
} catch (Exception ex) {
ex.printStackTrace();
}


Related Topics



Leave a reply



Submit