Java HTML Parsing

How can I efficiently parse HTML with Java?

Self plug: I have just released a new Java HTML parser: jsoup. I mention it here because I think it will do what you are after.

Its party trick is a CSS selector syntax to find elements, e.g.:

String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();

See the Selector javadoc for more info.

This is a new project, so any ideas for improvement are very welcome!

Parse HTML with Java

You can use the library JSoup.

Here is the link http://jsoup.org/

It is very simple to use. Here a simple example.

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

Parsing html in Jsoup

Full answer: you can get the text outside of the tags by getting childNodes(). This way you obtain List<Node>. Note I'm selecting body because your HTML fragment doesn't have any parent element and parsing HTML fragment with jsoup adds <html> and <body> automatically.

If Node contains only text it's of type TextNode and you can get the content using toString().

Otherwise you can cast it to Element and get the text usingelement.text().

    String str = "There are <span class='newStyle'> two </span> workers from the <span class='oldStyle'>Front of House</span>";
Document doc = Jsoup.parse(str);
Element body = doc.selectFirst("body");
List<Node> childNodes = body.childNodes();
for (int i = 0; i < childNodes.size(); i++) {
Node node = body.childNodes().get(i);
if (node instanceof TextNode) {
System.out.println(i + " -> " + node.toString());
} else {
Element element = (Element) node;
System.out.println(i + " -> " + element.text());
}
}

output:

0 -> 
There are
1 -> two
2 -> workers from the
3 -> Front of House

By the way: I don't know how to get rid of the first line break before There are.

Parsing an updating html using jsoup

It's rather ironic that you are attempting to hack1 out some data from an opendata website. There is surely an API!!

The problem is that websites aren't static resources; they have javascript, and that javascript can fetch more data in response to e.g. the user clicking a 'next page' button.

What you're doing is called 'scraping': Using automated tools to attempt to query for data via a communication channel (namely: This website) which is definitely not meant for that. This website is not meant to be read with software. It's meant to be read with eyeballs. If someone decides to change the design of this page and you did have a working scraper, it would then fail after the design update, for example.

You have, in broad strokes, 3 options:

Abort this plan, this is crazy

This data is surely open, and open data tends to come with APIs; things meant to be queried by software and not by eyeballs. Go look for it, and call the german government, I'm sure they'll help you out! If they've really embraced the REST principles of design, then send an accept header that including e.g. application/json and application/xml and does not include text/html and see if the site just responds with the data in JSON or XML format.

I strongly advise you fully exhaust these options before moving on to your next options, as the next options are really bad: Lots of work and the code will be extremely fragile (any updates on the site by the bundestag website folks will break it).

Use your browser's network inspection tools

In just about every browser there's 'dev tools'. For example, in Vivaldi, it's under the "Tools" menu and is called "Developer tools". You can also usually right click anywhere on a web page and there will be an option for 'Inspect', 'Inspector', or 'Development Tools'. Open that now, and find the 'network' tab. When you (re)load this page, you'll see all the resources its loading in (so, images, the HTML itself, CSS, the works). Look through it, find the interesting stuff. In this specific case, the loading of wahlperioden.json is of particular interest.

Let's try this out:

curl 'https://www.bundestag.de/static/appdata/filter/wahlperioden.json'

[{"value":"20","label":"WP 20: seit 2021"},{"value":"19","label":"WP 19: 2017 - 2021"},(rest omitted - there are a lot of these)]

That sounds useful, and as its JSON you can just read this stuff with a json parser. No need to use JSoup (JSoup is great as a library, but it's a library that you can use when all other options have failed, and any code written with JSoup is fragile and complicated simply because scraping sites is fragile and complicated).

Then, click on the buttons that 'load new data' and check if network traffic ensues. And so it does, when you do so, you notice a call going out. And so it is! I'm seeing this URL being loaded:

https://www.bundestag.de/ajax/filterlist/de/services/opendata/866354-866354?limit=10&noFilterSet=true&offset=10

The format is rather obvious. offset=10 means: Start from the 10th element (as I just clicked 'next page') and limit=10 means: NO more than 10 pages.

This html is also incredibly basic which is great news, as that makes it easy to scrape. Just write a for loop that keeps calling this URL, modifying the offset=10 part (first loop: no offset. Second, offset=10, third: offset=20. Keep going until the HTML you get back is blank, then you got it all).

For future reference: Browser emulation

Javascript can also generate entire HTML on its own; not something jsoup can ever do for you: The only way to obtain such HTML is to actually let the javascript do its work, which means you need an entire browser. Tools like selenium will start a real browser but let you use JSoup-like constructs to retrieve information from the page (instead of what browsers usually do, which is to transmit the rendered data to your eyeballs). This tends to always work, but is incredibly complicated and quite slow (you're running an entire browser and really rendering the site, even if you can't see it - that's happening under the hood!).

Selenium isn't meant as a scraping tool; it's meant as a front-end testing tool. But you can use it to scrape stuff, and will have to if its generated HTML. Fortunately, you're lucky here.

Option 1 is vastly superior to option 2, and option 2 is vastly superior to option 3, at least for this case. Good luck!

[1] I'm using the definition of: Using a tool or site to accomplish something it was obviously not designed for. The sense of 'I bought half an ikea cupboard and half of an ikea bookshelf that are completely unrelated, and put them together anyway, look at how awesome this thingie is' - that sense of 'hack'. Not the sense of 'illegal'.

Parse HTML Web Page

The scorers infos are acquired after an AJAX request (that occurs when you click the score link). You'll have to make such request and parse the result.

For instnace, take the first game (Manchester United 1x2 Manchester City), its tag is:

<a data-y="r1-1229442" data-v="england-premierleague-manchesterunited-manchestercity-13april2013" style="cursor: pointer;">1 - 2</a>

Take data-y, remove the leading r and make a get request to:

http://www.skore.com/en/scores/soccer/id/<DATA-Y_HERE>?fmt=html

Such as: http://www.skore.com/en/scores/soccer/id/1-1229442?fmt=html. And then parse the result.

Full working example:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParseScore {

public static void main(String[] args) throws Exception {
Document doc = Jsoup.connect("http://www.skore.com/en/soccer/england/premier-league/results/all/").get();
System.out.println("title: " + doc.title());

Elements dls = doc.select("dl");

for (Element link : dls) {
String id = link.attr("id");

/* check if then it is a game <dl> */
if (id != null && id.length() > 3 && "rid".equals(id.substring(0, 3))) {

System.out.println("Game: " + link.text());

String idNoRID = id.replace("rid", "");
// String idNoRID = "1-1229442";
String scoreURL = "http://www.skore.com/en/scores/soccer/id/" + idNoRID + "?fmt=html";
Document docScore = Jsoup.connect(scoreURL).get();

Elements trs = docScore.select("tr");
for (Element tr : trs) {
Elements spanGoal = tr.select("span.goal");
/* only enter if there is a goal */
if (spanGoal.size() > 0) {
Elements score = tr.select("td.score");
String playerName = spanGoal.get(0).text();
String currentScore = score.get(0).text();
System.out.println("\t\tGOAL: " + currentScore + ": " + playerName);
}

Elements spanGoalPenalty = tr.select("span.goalpenalty");
/* only enter if there is a goal */
if (spanGoalPenalty.size() > 0) {
Elements score = tr.select("td.score");
String playerName = spanGoalPenalty.get(0).text();
String currentScore = score.get(0).text();
System.out.println("\t\tGOAL: " + currentScore + ": " + playerName + " (penalty)");
}

Elements spanGoalOwn = tr.select("span.goalown");
/* only enter if there is a goal */
if (spanGoalOwn.size() > 0) {
Elements score = tr.select("td.score");
String playerName = spanGoalOwn.get(0).text();
String currentScore = score.get(0).text();
System.out.println("\t\tGOAL: " + currentScore + ": " + playerName + " (own goal)");
}
}
}
}
}
}

Output:

title: Skore : Premier League, England - Soccer Results (All)
Game: F T Arsenal 3 - 1 Norwich
GOAL: 0 - 1: Michael Turner
GOAL: 1 - 1: Mikel Arteta (penalty)
GOAL: 2 - 1: Sébastien Bassong (own goal)
GOAL: 3 - 1: Lukas Podolski
Game: F T Aston Villa 1 - 1 Fulham
GOAL: 1 - 0: Charles N´Zogbia
GOAL: 1 - 1: Fabian Delph (own goal)
Game: F T Everton 2 - 0 Queens Park Rangers
GOAL: 1 - 0: Darron Gibson
GOAL: 2 - 0: Victor Anichebe
Game: F T Reading 0 - 0 Liverpool
Game: F T Southampton 1 - 1 West Ham
GOAL: 1 - 0: Gaston Ramirez
GOAL: 1 - 1: Andrew Carroll
Game: F T Manchester United 1 - 2 Manchester City
GOAL: 0 - 1: James Milner
...

JSoup 1.7.1 was used. If using maven, add this to your pom.xml:

<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.7.1</version>
</dependency>

Parsing html file using JSOUP and mapping it to key-value pairs in JAVA

You can just use this:

Document document = Jsoup.parse(html);

Elements dts = document.getElementsByClass("dt dlterm");
Elements dds = document.getElementsByClass("dd");

if (dts.size() != dds.size()) {
// ensure same sizes of both lists
}

HashMap<String, String> values = new HashMap<>();
for (int i = 0; i < dts.size(); i++) {
values.put(dts.get(i).text(), dds.get(i).text());
}

Or in just one statement using Java Streams:

Map<String, String> values = IntStream.range(0, Math.min(dts.size(), dds.size())).boxed()
.collect(Collectors.toMap(i -> dts.get(i).text(),i -> dds.get(i).text()));

The result will be this:

{Risk=details of it two, Event=detials of it three., Incident=detials of one}

If you want to make sure the order in the map is the same as in the HTML code use a LinkedHashMap instead of a HashMap.



Related Topics



Leave a reply



Submit