Html/Xml Parser for Java

Parsing XML file containing HTML entities in Java without changing the XML

I would use a library like Jsoup for this purpose. I tested the following below and it works. I don't know if this helps. It can be located here: http://jsoup.org/download

public static void main(String args[]){

String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?><foo>" +
"<bar>Some text — invalid!</bar></foo>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());

for (Element e : doc.select("bar")) {
System.out.println(e);
}

}

Result:

<bar>
Some text — invalid!
</bar>

Loading from a file can be found here:

http://jsoup.org/cookbook/input/load-document-from-file

HTML/XML Parser for Java

Apache Tika is the best choice. Apache has recently extracted many sub-projects out of the existing projects and made them public. Tika is one of them that was previously a component of Apache Lucene. Because of Apache's support and reputation and the widely-used parent project Lucene it must be a very good choice. Furthermore, it is open-source.

A brief introduction from Apache Tika web site:

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

And the supported formats are:

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image formats
Video formats
Java class files and archives
The mbox format

Parse xml using java and keep html tags

One option is to use XML CDATA section as:

    <result>
<news><![CDATA[
<ul><li><p>as part of its growth plan,</p></li><li><p>in a bid to achieve the target</p></li><li><p>it is pointed out that most of ccl's production came from opencast mines and only 2 mt from underground (ug) mines. ccl is now trying to increase the share underground production. the board of ccl has, thus, approved the introduction of continuous mine in chiru ug at a cost of about rs 145 crore to raise this mine's production from 2 mt to 8 mt per annum.</p></li><li><p>mr ritolia said that.</p></li></ul>
]]>
</news>
</result>
</results>

Then your parser will not treat HTML tags as XML and allow you access to raw content of the element. The other option is to encode the HTML tags i.e. convert all < into <, > into >, & into & etc. For more on encoding see here

Best way for XML parsing in java

This question is excessively broad, so I had to downvote it. I have no idea what the circumstances of your XML interpretation are, so this answer will be limited.

However, I can tell you that classically SAX and JAXP have been used; they don't strictly require a DTD, and with some clever enumerations you can parse just about anything.

JSoup, as mentioned by Rafael Cardoso, is generally an HTML parser, not an HTML-in-XML parser; but it may work for you. If all you're looking for are the attributes to a specific tag, along with (presumably) associated data, then the JDK may have all that you need.

We also have JDOM, DOM4J, and a bunch of others, all of which have their strengths and weaknesses. This question, thus, isn't particularly constructive, and is basically a duplicate of this one; which you might take a look at.

I recommend looking at this tutorial; which explains how to build a parser with the standard library.

In the future, if possible please specify the conditions that your program is operating under, provide us with an objective and clearly defined question, and research Stack Overflow a little more thoroughly first. All the same, I hope this does it for you. Good luck!

Parsing an html document using an XML-parser

You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.

  • elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
  • elements that don’t need end tags; e.g., <p> <dt> <li> (their end tags can be implied)
  • elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
  • attributes with unquoted values; for example, <meta charset=utf-8>
  • attributes that are empty, with no separate value given at all; e.g., <input disabled>

XML parsers will fail to parse any HTML document that uses any of those features.

HTML parsers, on the other hand, will basically never fail no matter what a document contains.


All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.



The intended use is to make an HTML parser, that is part of a web
crawler application

If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.

These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:

  • parse5 (node.js/JavaScript)
  • html5lib (python)
  • html5ever (rust)
  • validator.nu html5 parser (java)
  • gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)




Related Topics



Leave a reply



Submit