Java Xml Parser for Huge Files

How to Parse Big (50 GB) XML Files in Java

Your parsing code is likely working fine, but the volume of data you're loading is probably just too large to hold in memory in that ArrayList.

You need some sort of pipeline to pass the data on to its actual destination without ever
store it all in memory at once.

What I've sometimes done for this sort of situation is similar to the following.

Create an interface for processing a single element:

public interface PageProcessor {
void process(Page page);
}

Supply an implementation of this to the PageHandler through a constructor:

public class Read  {
public static void main(String[] args) {

XMLManager.load(new PageProcessor() {
@Override
public void process(Page page) {
// Obviously you want to do something other than just printing,
// but I don't know what that is...
System.out.println(page);
}
}) ;
}

}

public class XMLManager {

public static void load(PageProcessor processor) {
SAXParserFactory factory = SAXParserFactory.newInstance();

try {

SAXParser parser = factory.newSAXParser();
File file = new File("pages-articles.xml");
PageHandler pageHandler = new PageHandler(processor);

parser.parse(file, pageHandler);

} catch (ParserConfigurationException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}

}
}

Send data to this processor instead of putting it in the list:

public class PageHandler extends DefaultHandler {

private final PageProcessor processor;
private Page page;
private StringBuilder stringBuilder;
private boolean idSet = false;

public PageHandler(PageProcessor processor) {
this.processor = processor;
}

@Override
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
//Unchanged from your implementation
}

@Override
public void characters(char[] ch, int start, int length) throws SAXException {
//Unchanged from your implementation
}

@Override
public void endElement(String uri, String localName, String qName) throws SAXException {
// Elide code not needing change

} else if (qName.equals("page")){

processor.process(page);
page = null;

}
} else {
page = null;
}
}

}

Of course, you can make your interface handle chunks of multiple records rather than just one and have the PageHandler collect pages locally in a smaller list and periodically send the list off for processing and clear the list.

Or (perhaps better) you could implement the PageProcessor interface as defined here and build in logic there that buffers the data and sends it on for further handling in chunks.

Java XML Parser for huge files

Aside the recommended SAX parsing, you could use the StAX API (kind of a SAX evolution), included in the JDK (package javax.xml.stream ).

  • StAX Project Home: http://stax.codehaus.org/Home
  • Brief introduction: http://www.xml.com/pub/a/2003/09/17/stax.html
  • Javadoc: https://docs.oracle.com/javase/8/docs/api/javax/xml/stream/package-summary.html

Java parse large XML document

As people have mentioned in the comments loading the whole DOM into memory especially for large XMLs can be very inefficient therefore a better approach is to use the SAX parser that consumes constant memory. The drawback there is that you don't get the fluent API of having the whole DOM in memory and the visibility is quite limited if you want to perform complicated callback logic in nested nodes.

If all you are interesting in doing is parsing particular nodes and node families rather than parsing the whole XML then there is a better solution that gives you the best of both worlds and has been blogged about and open-sourced. It's basically a very light wrapper on top of SAX parser where you are registering the XML elements you are interested in and when you are getting the callback you have at your disposal their corresponding partial DOM to XPath.

This way you can keep your complexity at constant time (scaling to over 1GB of XML file as documented in the above blog) while maintaining the fluency of XPath-ing the DOM of the XML elements you are interested in.

Parsing large XML documents in JAVA

SAX (Simple API for XML) will help you here.

Unlike the DOM parser, the SAX parser does not create an in-memory
representation of the XML document and so is faster and uses less
memory. Instead, the SAX parser informs clients of the XML document
structure by invoking callbacks, that is, by invoking methods on a
org.xml.sax.helpers.DefaultHandler instance provided to the parser.

Here is an example implementation:

SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
DefaultHandler handler = new MyHandler();
parser.parse("file.xml", handler);

Where in MyHandler you define the actions to be taken when events like start/end of document/element are generated.

class MyHandler extends DefaultHandler {

@Override
public void startDocument() throws SAXException {
}

@Override
public void endDocument() throws SAXException {
}

@Override
public void startElement(String uri, String localName, String qName,
Attributes attributes) throws SAXException {
}

@Override
public void endElement(String uri, String localName, String qName)
throws SAXException {
}

// To take specific actions for each chunk of character data (such as
// adding the data to a node or buffer, or printing it to a file).
@Override
public void characters(char ch[], int start, int length)
throws SAXException {
}

}

Parsing a big xml file Java

Using STAX gives you more control when parsing XML, since you actively pull elements from the stream. This way you can pull the next event, handle it and once you found your data, simply terminate the loop (using a flag or even a return statement if you must)

InputStream in = ...
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader = factory.createXMLEventReader(in);

boolean found = false;
while (!found && eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
switch (event.getEventType()) {
case XMLStreamConstants.START_ELEMENT:
// your logic here
// once you found your element, you can terminate the loop
found = true;
break;
case XMLStreamConstants.END_ELEMENT:
// your logic here
break;
}
}

(omitted exception and resource handling for brevity)

On a side note, you will gain some performance by combining your if ((i).equals(rsID) && ... into a single one, with detail checks in nested ifs

if ((i).equals(rsID)) {
if(qName.equalsIgnoreCase("GTypeFreq")) {
...
}
}

Parsing very large XML documents (and a bit more) in java

Stax is the right way. I would recommend looking at Woodstox



Related Topics



Leave a reply



Submit