Jsoup Get Dynamically Generated HTML

Getting Jsoup to support dynamically generated html by JavaScript

Jsoup does not support javascript and it does not emulate a browser. Just forget about it if you're planning to execute Javascript. In my experience HtmlUnit, which is a headless browser, has given me the best results (always talking about Java frameworks).

One thing that worths trying in HtmlUnit is changing the BrowserVersion (Chrome / InternetEplorer / FireFox) while creating the WebClient instance. Some sites react in a different way and sometimes just changing that value might give you the results you expect to get.

How to read/parse dynamically generated client side content in Android using Java

Jsoup can't parse JavaScript so it can't be used here.

It can be done with Selenium webdriver or in case of Android use Selendroid.

Get full HTML using Jsoup

Most likely the elements you see are dynamically added to the DOM by some JavaScript code. That means they are not available in the body of the request when you use Jsoup.

Jsoup Scraping HTML dynamic content

You can use the .select(String cssQuery) method:

doc.select("h1") gives you all h1 Elements.
If you need the actual Text in these tags use the .text() for each Element.
If you need a attribute like class or id use .attr(String attributeKey) on a Element eg:

doc.getElementsByClass("hover_item_name").first().attr("id")

gives you "iteminfo0_item_name"

But if you need to perform clicks on a website you can't do that with JSoup, hence JSoup is a HTML parser and not a browser alternative. Jsoup can't handle dynamic content.

But what you could do is, firstly scrape the relevant data in your h1 tags and then send a new .post() request, respectively an ajax call

If you rather want a real webdriver, have a look at Selenium.

Jsoup parse dynamically loading webpage in Java

EDIT - After few comments from the OP, I understood exectly what he wants to acheive. I've changed a bit my original solution and tested it.

You can do it with JSOUP. After the first page, getting the next one requiers you to sen a post request with some headers. The headers contains (among other) the start number and how many records to get. If you send an illegel number (i.e. you ask the page that contains game number 700 but the results contain only 600 games), you get the first page again. You can loop thru the pages, until you get a result that you already have.

Sometimes the server returns 600 results and sometimes only 540, I could not figure why.

The code for that is -

import java.util.regex.Pattern;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class HelloWorld {

public static void main(String[] args) {

Connection.Response res = null;
Document doc = null;
Boolean OK = true;
int start = 0;
String query;
ArrayList<String> tempList = new ArrayList<>();
ArrayList<String> games = new ArrayList<>();
Pattern r = Pattern.compile("title=\"(.*)\" a");

try { //first connection with GET request
res = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free")
.method(Method.GET)
.execute();
doc = res.parse();
} catch (Exception ex) {
//Do some exception handling here
}
for (int i=1; i <= 60; i++) { //parse the result and add it to the list
query = "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)";
tempList.add(doc.select(query).toString());
}

while (OK) { //loop until you get the same results again
start += 60;
System.out.println("now at number " + start);
try { //send post request for each new page
doc = Jsoup.connect("https://play.google.com/store/apps/category/GAME_ACTION/collection/topselling_free?authuser=0")
.cookies(res.cookies())
.data("start", String.valueOf(start))
.data("num", "60")
.data("numChildren", "0")
.data("ipf", "1")
.data("xhr", "1")
.post();
} catch (Exception ex) {
//Do some exception handling here
}
for (int i=1; i <= 60; i++) { //parse the result and add it to the list
query = "div.card:nth-child(" + i + ") > div:nth-child(1) > div:nth-child(3) > h2:nth-child(2) > a:nth-child(1)";
if (!tempList.contains(doc.select(query).toString())) {
tempList.add(doc.select(query).toString());
} else { //we've seen these games before, time to quit
OK = false;
break;
}
}
}
for (int i = 0; i < tempList.size(); i++) { //remove all redundent info.
Matcher m = r.matcher(tempList.get(i));
if (m.find()) {
games.add(m.group(1));
System.out.println((i + 1) + " " + games.get(i));
}
}
}
}

The code can be further improved (like handling all the lists at a seperate method), so it's up to you.

I hope this does the work for you.



Related Topics



Leave a reply



Submit