What HTML parsing libraries do you recommend in Java
NekoHTML, TagSoup, and JTidy will allow you to parse HTML and then process with XML tools, like XPath.
Which html DOM parser library for Java is best?
JSoup is fantastic. Highly recommended.
How can I efficiently parse HTML with Java?
Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.
Its party trick is a CSS selector syntax to find elements, e.g.:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
See the Selector javadoc for more info.
This is a new project, so any ideas for improvement are very welcome!
What are the pros and cons of the leading Java HTML parsers?
General
Almost all known HTML parsers implements the W3C DOM API (part of the JAXP API, Java API for XML processing) and gives you a org.w3c.dom.Document
back which is ready for direct use by JAXP API. The major differences are usually to be found in the features of the parser in question. Most parsers are to a certain degree forgiving and lenient with non-wellformed HTML ("tagsoup"), like JTidy, NekoHTML, TagSoup and HtmlCleaner. You usually use this kind of HTML parsers to "tidy" the HTML source (e.g. replacing the HTML-valid <br>
by a XML-valid <br />
), so that you can traverse it "the usual way" using the W3C DOM and JAXP API.
The only ones which jumps out are HtmlUnit and Jsoup.
HtmlUnit
HtmlUnit provides a completely own API which gives you the possibility to act like a webbrowser programmatically. I.e. enter form values, click elements, invoke JavaScript, etcetera. It's much more than alone a HTML parser. It's a real "GUI-less webbrowser" and HTML unit testing tool.
Jsoup
Jsoup also provides a completely own API. It gives you the possibility to select elements using jQuery-like CSS selectors and provides a slick API to traverse the HTML DOM tree to get the elements of interest.
Particularly the traversing of the HTML DOM tree is the major strength of Jsoup. Ones who have worked with org.w3c.dom.Document
know what a hell of pain it is to traverse the DOM using the verbose NodeList
and Node
APIs. True, XPath
makes the life easier, but still, it's another learning curve and it can end up to be still verbose.
Here's an example which uses a "plain" W3C DOM parser like JTidy in combination with XPath to extract the first paragraph of your question and the names of all answerers (I am using XPath since without it, the code needed to gather the information of interest would otherwise grow up 10 times as big, without writing utility/helper methods).
String url = "http://stackoverflow.com/questions/3152138";
Document document = new Tidy().parseDOM(new URL(url).openStream(), null);
XPath xpath = XPathFactory.newInstance().newXPath();
Node question = (Node) xpath.compile("//*[@id='question']//*[contains(@class,'post-text')]//p[1]").evaluate(document, XPathConstants.NODE);
System.out.println("Question: " + question.getFirstChild().getNodeValue());
NodeList answerers = (NodeList) xpath.compile("//*[@id='answers']//*[contains(@class,'user-details')]//a[1]").evaluate(document, XPathConstants.NODESET);
for (int i = 0; i < answerers.getLength(); i++) {
System.out.println("Answerer: " + answerers.item(i).getFirstChild().getNodeValue());
}
And here's an example how to do exactly the same with Jsoup:
String url = "http://stackoverflow.com/questions/3152138";
Document document = Jsoup.connect(url).get();
Element question = document.select("#question .post-text p").first();
System.out.println("Question: " + question.text());
Elements answerers = document.select("#answers .user-details a");
for (Element answerer : answerers) {
System.out.println("Answerer: " + answerer.text());
}
Do you see the difference? It's not only less code, but Jsoup is also relatively easy to grasp if you already have moderate experience with CSS selectors (by e.g. developing websites and/or using jQuery).
Summary
The pros and cons of each should be clear enough now. If you just want to use the standard JAXP API to traverse it, then go for the first mentioned group of parsers. There are pretty a lot of them. Which one to choose depends on the features it provides (how is HTML cleaning made easy for you? are there some listeners/interceptors and tag-specific cleaners?) and the robustness of the library (how often is it updated/maintained/fixed?). If you like to unit test the HTML, then HtmlUnit is the way to go. If you like to extract specific data from the HTML (which is more than often the real world requirement), then Jsoup is the way to go.
How to parse and modify HTML file in Java
if you want to modify web page and return modified content, I thnk the best way is to use XSL transformation.
http://en.wikipedia.org/wiki/XSLT
Java Html parser to extract specific data?
Personally, I favour JSoup over JTidy. It has CSS-like selectors, and the documentation is much better, imho. With JSoup, you can easily extract those values with the following lines:
Document doc = Jsoup.connect("your_url").get();
Elements spans = doc.select("span[itemprop]");
for (Element span : spans) {
System.out.println(span.text()); // will print 234 and 690
}
Text extraction with java html parsers
I recently experimented with HtmlCleaner and CyberNekoHtml. CyberNekoHtml is a DOM/SAX parser that produces predictable results. HtmlCleaner is a tad faster, but quite often fails to produce accurate results.
I would recommend CyberNekoHtml. CyberNekoHtml can do all of the things you mentioned. It is very easy to extract a list of all elements, and their attributes, for example. It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page.
There's a list of open source java html parsers here:
http://java-source.net/open-source/html-parsers
Is there a Standard Java SE HTML Parser? If so, why use non-standard ones?
JDK has built-in HTML parser that supports HTML 1.0 or so. It should support parsing of base text formatting tags and forms.
The reason to use other, third party parsers is requirement to support "real" HTML pages DHTML, JavaScript etc.
JSoup is one of popular parsers that can do the job. For more information about other implementations please take a look on the following discussion:
Pure Java HTML viewer/renderer for use in a Scrollable pane
Related Topics
"Integer Number Too Large" Error Message for 600851475143
Java Remove Duplicates from an Array
Strange Floating-Point Behaviour in a Java Program
Individual and Not Continuous Jtable's Cell Selection
Dealing with "Xerces Hell" in Java/Maven
Design Patterns: Factory VS Factory Method VS Abstract Factory
Differencebetween "Class.Forname()" and "Class.Forname().Newinstance()"
Hibernate Opensession() VS Getcurrentsession()
When Should One Use Final for Method Parameters and Local Variables
Why Is "Extends T" Allowed But Not "Implements T"
Spring MVC: How to Perform Validation
How to Find Unused/Dead Code in Java Projects
Why Would a Static Nested Interface Be Used in Java
How to Add a Filter Class in Spring Boot
Java 8 Lambdas, Function.Identity() or T->T
Why Doesn't Java Support Unsigned Ints
Convenient Way to Parse Incoming Multipart/Form-Data Parameters in a Servlet