How to Efficiently Parse HTML With Java

How can I efficiently parse HTML with Java?

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Parse HTML with Java

You can use the library JSoup.

Here is the link http://jsoup.org/

It is very simple to use. Here a simple example.

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

How to parse HTML with java properly?

Solved.
I didn't save changes. Need to add code befire "downloadImage()"

    int i = 0;
File input = new File(result_doc);
Document doc = Jsoup.parse(input, "UTF-8");
for( Element img : doc.select("img[src]") ) {
img.attr("src",array[i]); // set attribute 'src' to 'your-source-here'
++i;
}

try {
String strmb = doc.outerHtml();
bw = new BufferedWriter(new FileWriter(result_doc));
bw.write(strmb);
bw.close();
}
catch (Exception ex) {
System.out.println("Program stopped. The problem is " + "\"" +
ex.getMessage()+"\"");
}

How to correctly parse HTML in Java

So after a little playing around with the site I came up with a solution.

Now the site uses API responses to get the prices for each item, this is why you are not getting the prices in your HTML that you are receiving from Jsoup. Unfortunately there's a little more code than first expected, and you'll have to do some working out on how it should know which product Id to use instead of the hardcoded value. However, other than that the following code should work in your case.

I've included comments that hopefully explain each step, and I recommend taking a look at the API response, as there maybe some other data you require, in fact this maybe the same with the product details and description, as further data will need to be parsed out of elementById field.

Good luck and let me know if you need any further help!

import org.json.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Main
{
final String productID = "8513070";
final String productURL = "http://www.asos.com/prd/";
final Product product = new Product();

public static void main( String[] args )
{
new Main();
}

private Main()
{
getProductDetails( productURL, productID );
System.out.println( "ID: " + product.productID + ", Name: " + product.productName + ", Price: " + product.productPrice );
}

private void getProductDetails( String url, String productID )
{
try
{
// Append the product url and the product id to retrieve the product HTML
final String appendedURL = url + productID;

// Using Jsoup we'll connect to the url and get the HTML
Document document = Jsoup.connect( appendedURL ).get();
// We parse the HTML only looking for the product section
Element elementById = document.getElementById( "asos-product" );
// To simply get the title we look for the H1 tag
Elements h1 = elementById.getElementsByTag( "h1" );

// Because more than one H1 tag is returned we only want the tag that isn't empty
if ( !h1.text().isEmpty() )
{
// Add all data to Product object
product.productID = productID;
product.productName = h1.text().trim();
product.productPrice = getProductPrice(productID);
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}

private String getProductPrice( String productID )
{
try
{
// Append the api url and the product id to retrieve the product price JSON document
final String apiURL = "http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=" + productID + "&store=COM";
// Using Jsoup again we connect to the URL ignoring the content type and retrieve the body
String jsonDoc = Jsoup.connect( apiURL ).ignoreContentType( true ).execute().body();

// As its JSON we want to parse the JSONArray until we get to the current price and return it.
JSONArray jsonArray = new JSONArray( jsonDoc );
JSONObject currentProductPriceObj = jsonArray
.getJSONObject( 0 )
.getJSONObject( "productPrice" )
.getJSONObject( "current" );
return currentProductPriceObj.getString( "text" );
}
catch ( IOException e )
{
e.printStackTrace();
}

return "";
}

// Simple Product object to store the data
class Product
{
String productID;
String productName;
String productPrice;
}
}

Oh, and you'll also need org.json for parse the JSON response from the API.

Efficient way to parse HTML dump found in the form of string

If you have used Jsoup and it worked yet, you should continue using it.

Document doc = Jsoup.parse("<html>...");

should do.

see: The API

Parsing huge amount of HTML with java

I think you should look for html parsers from this page :

Comparison of HTML parsers

Creating a script might not be a good idea. You may have inline css, javascript, escape quotes already. It will be a huge amount of pain to do this correctly.Previously, I had tried writing a script but found it cumbersome.Finally, I tried with html parsers and it worked like a charm!



Related Topics



Leave a reply



Submit