Java Sax Parser Split Calls to Characters()

JAVA SAX parser split calls to characters()

Parser is calling characters method more than one time, because it can and allowed per spec. This helps fast parser and keep their memory footprint low. If you want a single string create a new StringBuilder object in the startElement and process it on endElement method.

SAX parsing and special characters

My guess is that you are treating each call to characters as delivering the complete text for a cat element. You should code your handler so that successive calls to characters accumulate the text, and you only capture it on the endElement event:

public class CatHandler extends DefaultHandler {
private StringBuilder chars = new StringBuilder();

public void startElement(String uri, String lName, String qName, Attributes a)
{
final String name = qName == null ? lName : qName;
if ("cat".equals(name)) {
chars.setLength(0);
} else . . .
}

public void endElement(String uri, String lName, String qName) {
final String name = qName == null ? lName : qName;
if ("cat".equals(name)) {
String catName = chars.toString();
// do something with cat name
} else . . .
}

public void characters(char[] ch, int start, int length) {
chars.append(ch, start, length);
}

Java SaxParser trim the string after &

This is a lesson everyone has to learn when using SAX: the parser can break up text nodes and report the content in multiple calls to characters(), and it's the application's job to reassemble it (e.g. by using a StringBuilder). It's very common for parsers to break the text at any point where it would otherwise have to shunt characters around in memory, e.g. where entity references occur or where it hits an I/O buffer boundary.

It was designed this way to make SAX parsers super-efficient by minimizing text copying, but I suspect there's no real benefit, because the text copying just has to be done by the application instead.

Don't try and second-guess the parser as @DavidWallace suggests. The parser is allowed to break the text up any way it likes, and your application should cater for that.

Sax characters breaking element apart

The parser is allowed to call the ContentHandler characters method multiple times for each string of element text, it's not finding a line terminator necessarily. the Java tutorial on SAX has a short explanation of the characters method:

Parsers are not required to return any particular number of characters at one time. A parser can return anything from a single character at a time up to several thousand and still be a standard-conforming implementation. So if your application needs to process the characters it sees, it is wise to have the characters() method accumulate the characters in a java.lang.StringBuffer and operate on them only when you are sure that all of them have been found.

Also this Javaworld article has good explanations and examples.

Parse value containing special character / gives wrong output using SAX parser

Just change character() method

@Override
public void characters(char[] buffer, int start, int length) {
tmpValue += new String(buffer, start, length);
}

And add this at last line in the endElement method .

public void endElement(String s, String s1, String element) throws SAXException {

if (OrgDataPartitonObj != null && "fs:FinancialStatementLineItemDataItem".equals(OrgDataPartitonObj.getType())) {

FinancialStatementLineItemParser.getEndElementFinancialStatementLineItemParser(financialStatementLineItemObj, element, tmpValue);
}
tmpValue="";
}

Sax Parser - Unable to split XML file to specified size

You should call the setContentHandler before the parse.



Related Topics



Leave a reply



Submit