Stax Xmlstreamreader Check for the Next Event Without Moving Ahead

Is there a way to store the XMLStreamReader values temporarily within another XMLStreamReader for validations?

Posting this answer as it can be useful to someone in the future. After trying a lot and researching I found we can store the stream in MutableXMLStreamBuffer.

//Store the XMLStreamReader values within the buffer
final MutableXMLStreamBuffer buffer = new MutableXMLStreamBuffer();
buffer.createFromXMLStreamReader(xmlStreamReader);

Then you can access the same using following:

buffer.readAsXMLStreamReader()

Using StAX to read all text elements

Firstly, if you filter to include only start and end element events then you won't see the text contained inside your leaf nodes at all. I would use a different approach, with an unfiltered stream, like this:

XMLEventReader eventReader = factory.createXMLEventReader(in);
StringBuilder content = null;
while(eventReader.hasNext()) {
  XMLEvent event = eventReader.nextEvent();
  if(event.isStartElement()) {
    // other start element processing here
    content = new StringBuilder();
  } else if(event.isEndElement()) {
    if(content != null) {
      // this was a leaf element
      String leafText = content.toString();
      // do something with the leaf node
    } else {
      // not a leaf
    }
    // in all cases, discard content
    content = null;
  } else if(event.isCharacters()) {
    if(content != null) {
      content.append(event.asCharacters().getData());
    }
  }
  // other event types here
}

The trick is the content = null at the end of the end element section - on entry to the if(event.isEndElement()) block if content is non-null then you know there have been no intervening end element events between this one and its corresponding start tag, i.e. it's a leaf node.

Using StAX to create index for XML for quick access

You could work with a generated XML parser using ANTLR4.

The Following works very well on a ~17GB Wikipedia dump /20170501/dewiki-20170501-pages-articles-multistream.xml.bz2 but I had to increase heap size using -xX6GB.

1. Get XML Grammar

cd /tmp
git clone https://github.com/antlr/grammars-v4

2. Generate Parser

cd /tmp/grammars-v4/xml/
mvn clean install

3. Copy Generated Java files to your Project

cp -r target/generated-sources/antlr4 /path/to/your/project/gen

4. Hook in with a Listener to collect character offsets

package stack43366566;

import java.util.ArrayList;
import java.util.List;

import org.antlr.v4.runtime.ANTLRFileStream;
import org.antlr.v4.runtime.CommonTokenStream;
import org.antlr.v4.runtime.tree.ParseTreeWalker;

import stack43366566.gen.XMLLexer;
import stack43366566.gen.XMLParser;
import stack43366566.gen.XMLParser.DocumentContext;
import stack43366566.gen.XMLParserBaseListener;

public class FindXmlOffset {

    List<Integer> offsets = null;
    String searchForElement = null;

    public class MyXMLListener extends XMLParserBaseListener {
        public void enterElement(XMLParser.ElementContext ctx) {
            String name = ctx.Name().get(0).getText();
            if (searchForElement.equals(name)) {
                offsets.add(ctx.start.getStartIndex());
            }
        }
    }

    public List<Integer> createOffsets(String file, String elementName) {
        searchForElement = elementName;
        offsets = new ArrayList<>();
        try {
            XMLLexer lexer = new XMLLexer(new ANTLRFileStream(file));
            CommonTokenStream tokens = new CommonTokenStream(lexer);
            XMLParser parser = new XMLParser(tokens);
            DocumentContext ctx = parser.document();
            ParseTreeWalker walker = new ParseTreeWalker();
            MyXMLListener listener = new MyXMLListener();
            walker.walk(listener, ctx);
            return offsets;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

    public static void main(String[] arg) {
        System.out.println("Search for offsets.");
        List<Integer> offsets = new FindXmlOffset().createOffsets("/tmp/dewiki-20170501-pages-articles-multistream.xml",
                        "page");
        System.out.println("Offsets: " + offsets);
    }

}

5. Result

Prints:

Offsets: [2441, 10854, 30257, 51419 ....

6. Read from Offset Position

To test the code I've written class that reads in each wikipedia page to a java object

@JacksonXmlRootElement
class Page {
   public Page(){};
   public String title;
}

using basically this code

private Page readPage(Integer offset, String filename) {
        try (Reader in = new FileReader(filename)) {
            in.skip(offset);
            ObjectMapper mapper = new XmlMapper();
             mapper.configure(DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES, false);
            Page object = mapper.readValue(in, Page.class);
            return object;
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

Find complete example on github.

Java how to use XMLStreamReader and XMLStreamWriter in the same method

Use an XMLOutputFacotry instead to create the output stream

XMLInputFactory factory = XMLInputFactory.newInstance();
XMLOutputFactory outFactory = XMLOutputFactory.newInstance();
try {
    XMLStreamReader dataXML = factory.createXMLStreamReader(new FileReader(path));
    XMLStreamWriter dataWXML = factory.createXMLStreamWriter(new FileReader(otherPath));
    ...
}

Note the use of another path for the output file

Problems getting XML node text in StAX XMLStreamConstants.CHARACTERS event

I have solved the problem after struggling and researching a bit.

It was a problem reading text with escaped entity references. You need to set
XMLInputFactory IS_COALESCING to true

XMLInputFactory.setProperty(XMLInputFactory.IS_COALESCING, true);

Basically this tells the parser to replace internal entity references with their respective replacement text (in other words, something like decoding) and read them as normal characters.