Parse Web Site HTML With Java

Parse Web Site HTML with JAVA

There is a much easier way to do this. I suggest using JSoup. With JSoup you can do things like

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

Or if you want the body:

Elements body = doc.select("body");

Or if you want all links:

Elements links = doc.select("body a");

You no longer need to get connections or handle streams. Simple. If you have ever used jQuery then it is very similar to that.

How to correctly parse HTML in Java

So after a little playing around with the site I came up with a solution.

Now the site uses API responses to get the prices for each item, this is why you are not getting the prices in your HTML that you are receiving from Jsoup. Unfortunately there's a little more code than first expected, and you'll have to do some working out on how it should know which product Id to use instead of the hardcoded value. However, other than that the following code should work in your case.

I've included comments that hopefully explain each step, and I recommend taking a look at the API response, as there maybe some other data you require, in fact this maybe the same with the product details and description, as further data will need to be parsed out of elementById field.

Good luck and let me know if you need any further help!

import org.json.*;
import org.jsoup.Jsoup;
import org.jsoup.nodes.*;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Main
{
final String productID = "8513070";
final String productURL = "http://www.asos.com/prd/";
final Product product = new Product();

public static void main( String[] args )
{
new Main();
}

private Main()
{
getProductDetails( productURL, productID );
System.out.println( "ID: " + product.productID + ", Name: " + product.productName + ", Price: " + product.productPrice );
}

private void getProductDetails( String url, String productID )
{
try
{
// Append the product url and the product id to retrieve the product HTML
final String appendedURL = url + productID;

// Using Jsoup we'll connect to the url and get the HTML
Document document = Jsoup.connect( appendedURL ).get();
// We parse the HTML only looking for the product section
Element elementById = document.getElementById( "asos-product" );
// To simply get the title we look for the H1 tag
Elements h1 = elementById.getElementsByTag( "h1" );

// Because more than one H1 tag is returned we only want the tag that isn't empty
if ( !h1.text().isEmpty() )
{
// Add all data to Product object
product.productID = productID;
product.productName = h1.text().trim();
product.productPrice = getProductPrice(productID);
}
}
catch ( IOException e )
{
e.printStackTrace();
}
}

private String getProductPrice( String productID )
{
try
{
// Append the api url and the product id to retrieve the product price JSON document
final String apiURL = "http://www.asos.com/api/product/catalogue/v2/stockprice?productIds=" + productID + "&store=COM";
// Using Jsoup again we connect to the URL ignoring the content type and retrieve the body
String jsonDoc = Jsoup.connect( apiURL ).ignoreContentType( true ).execute().body();

// As its JSON we want to parse the JSONArray until we get to the current price and return it.
JSONArray jsonArray = new JSONArray( jsonDoc );
JSONObject currentProductPriceObj = jsonArray
.getJSONObject( 0 )
.getJSONObject( "productPrice" )
.getJSONObject( "current" );
return currentProductPriceObj.getString( "text" );
}
catch ( IOException e )
{
e.printStackTrace();
}

return "";
}

// Simple Product object to store the data
class Product
{
String productID;
String productName;
String productPrice;
}
}

Oh, and you'll also need org.json for parse the JSON response from the API.

How to parse a specific part of web-page in java?

You can use jsoup to parse your html to a Document and use the Jsoup DOM methods to navigate your html.

String yourHtml = "<someHtml/>"; 
Document doc = Jsoup.parseBodyFragment(yourHtml);

Element table = doc.getElementByTag("tbody");
Elements rows = table.getElementsByTag("tr");
for (Element row : rows) {
for (Element cell : row.getElementsByTag("td")) {
String content = cell.text();
}
}

Getting Content From a Website with Java

For the temperature try this:

String url = "http://weather.gc.ca/city/pages/on-4_metric_e.html";
Document document = Jsoup.connect(url).get();

String temp = document.select("p").get(1).text();
System.out.println("Temperature: " + temp);

For formulating the CSS queries refer to the syntax sheet: http://jsoup.org/cookbook/extracting-data/selector-syntax

Also try: http://try.jsoup.org/, great for testing!

How can I efficiently parse HTML with Java?

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Parse HTML with Java

You can use the library JSoup.

Here is the link http://jsoup.org/

It is very simple to use. Here a simple example.

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

Parse HTML Web Page

The scorers infos are acquired after an AJAX request (that occurs when you click the score link). You'll have to make such request and parse the result.

For instnace, take the first game (Manchester United 1x2 Manchester City), its tag is:

<a data-y="r1-1229442" data-v="england-premierleague-manchesterunited-manchestercity-13april2013" style="cursor: pointer;">1 - 2</a>

Take data-y, remove the leading r and make a get request to:

http://www.skore.com/en/scores/soccer/id/<DATA-Y_HERE>?fmt=html

Such as: http://www.skore.com/en/scores/soccer/id/1-1229442?fmt=html. And then parse the result.

Full working example:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParseScore {

public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://www.skore.com/en/soccer/england/premier-league/results/all/").get();
System.out.println("title: " + doc.title());

Elements dls = doc.select("dl");

for (Element link : dls) {
String id = link.attr("id");

/* check if then it is a game <dl> */
if (id != null && id.length() > 3 && "rid".equals(id.substring(0, 3))) {

System.out.println("Game: " + link.text());

String idNoRID = id.replace("rid", "");
// String idNoRID = "1-1229442";
String scoreURL = "http://www.skore.com/en/scores/soccer/id/" + idNoRID + "?fmt=html";
Document docScore = Jsoup.connect(scoreURL).get();

Elements trs = docScore.select("tr");
for (Element tr : trs) {
Elements spanGoal = tr.select("span.goal");
/* only enter if there is a goal */
if (spanGoal.size() > 0) {
Elements score = tr.select("td.score");
String playerName = spanGoal.get(0).text();
String currentScore = score.get(0).text();
System.out.println("\t\tGOAL: " + currentScore + ": " + playerName);
}

Elements spanGoalPenalty = tr.select("span.goalpenalty");
/* only enter if there is a goal */
if (spanGoalPenalty.size() > 0) {
Elements score = tr.select("td.score");
String playerName = spanGoalPenalty.get(0).text();
String currentScore = score.get(0).text();
System.out.println("\t\tGOAL: " + currentScore + ": " + playerName + " (penalty)");
}

Elements spanGoalOwn = tr.select("span.goalown");
/* only enter if there is a goal */
if (spanGoalOwn.size() > 0) {
Elements score = tr.select("td.score");
String playerName = spanGoalOwn.get(0).text();
String currentScore = score.get(0).text();
System.out.println("\t\tGOAL: " + currentScore + ": " + playerName + " (own goal)");
}
}
}
}
}
}

Output:

title: Skore : Premier League, England - Soccer Results (All)
Game: F T Arsenal 3 - 1 Norwich
GOAL: 0 - 1: Michael Turner
GOAL: 1 - 1: Mikel Arteta (penalty)
GOAL: 2 - 1: Sébastien Bassong (own goal)
GOAL: 3 - 1: Lukas Podolski
Game: F T Aston Villa 1 - 1 Fulham
GOAL: 1 - 0: Charles N´Zogbia
GOAL: 1 - 1: Fabian Delph (own goal)
Game: F T Everton 2 - 0 Queens Park Rangers
GOAL: 1 - 0: Darron Gibson
GOAL: 2 - 0: Victor Anichebe
Game: F T Reading 0 - 0 Liverpool
Game: F T Southampton 1 - 1 West Ham
GOAL: 1 - 0: Gaston Ramirez
GOAL: 1 - 1: Andrew Carroll
Game: F T Manchester United 1 - 2 Manchester City
GOAL: 0 - 1: James Milner
...

JSoup 1.7.1 was used. If using maven, add this to your pom.xml:

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.1</version>
</dependency>

Parsing HTML page containing JS in Java

Selenium's Webdriver is fantastic: http://docs.seleniumhq.org/docs/03_webdriver.jsp

See this answer for an example of what you are trying to do:
Using Selenium Web Driver to retrieve value of a HTML input



Related Topics



Leave a reply



Submit