Web Scraping with Java

Web scraping with Java

jsoup

Extracting the title is not difficult, and you have many options, search here on Stack Overflow for "Java HTML parsers". One of them is Jsoup.

You can navigate the page using DOM if you know the page structure, see
http://jsoup.org/cookbook/extracting-data/dom-navigation

It's a good library and I've used it in my last projects.

Scrape information from Web Pages with Java?

As @Alex R pointed out, you'll need a Web Scraping library for this.

The one he recommended, JSoup, is quite robust and is pretty commonly used for this task in Java, at least in my experience.

You'd first need to construct a document that fetches your page, eg:

int localID = 25022; //your player's ID.
Document doc = Jsoup.connect("http://www.chess.org.il/Players/Player.aspx?Id=" + localID).get();

From this Document Object, you can fetch a lot of information, for example the FIDE ID you requested, unfortunately the web page you linked inst very simple to scrape, and you'll need to basically go through every link on the page to find the relevant link, for example:

Elements fidelinks = doc.select("a[href*=fide.com]");

This Elements object should give you a list of all links that link to anything containing the text fide.com, but you probably only want the first one, eg:

Element fideurl = doc.selectFirst("a[href=*=fide.com]");

From that point on, I don't want to write all the code for you, but hopefully this answer serves as a good starting point!

You can get the ID alone by calling the text() method on your Element object, but You can also get the link itself by just calling Element.attr('href')

The css selector you can use to get the other value is
div#main-col table.contentpaneopen tbody tr td table tbody tr td table tbody tr:nth-of-type(4) td table tbody tr td:first-of-type, which will get you the std score specifically, at least with standard css, so this should work with jsoup as well.

Real time web scraping with Java

Quick answer is a headless browser. Most of those sites serve up new information via a socket/ ajax/ asynch w/ page load approach. So to be able to crawl dynamic sites you are absolutely right, the easiest way to do that is to behave more like a browser than a script. There are plenty of ways to do that with selenium or phantomjs. Normally people will use something like nutch to control the crawling flow at scale. You may also want to look into a proxy farm.

Going to next page when web scraping with Jsoup

It seems the pagination for that site is controlled by the ?page=<int> query parameter.
Simply wrap your existing code in a for loop that will control the current page.

int numPages = 5; // the number of pages to scrape
for (int i = 0; i < numPages; i++) {
String url = "https://www.actksa.com/ar/training-courses/training-in/Jeddah?page=" + i;

Document doc = Jsoup.connect(url).get();

Elements data = doc.select("tr");
int size = data.size();
Log.d("doc", "doc: "+doc);
Log.d("data", "data: "+data);
Log.d("size", ""+size);
for (int j = 0; j < size; j++) {
String title = data.select("td.wp-60")
.eq(j)
.text();
String detailUrl = data.select("td.wp-60")
.select("a")
.eq(j)
.attr("href");
parseItems.add(new ParseItem(title, detailUrl));
Log.d("items"," . title: " + title);
}
}

If you want to get all the pages without hardcoding the numbers, you put the incrementing in a while loop that will break when the table on the page has no contents. For example https://www.actksa.com/ar/training-courses/training-in/jeddah?page=6 is not a valid page, and just shows a page with an empty table.

Web Scraping with Java using HTMLUnit

Have used this code to verify your problem:

public static void main(String[] args) throws IOException {
final String url = "https://www.nba.com/standings#/";

try (final WebClient webClient = new WebClient()) {
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);

HtmlPage page = webClient.getPage(url);
webClient.waitForBackgroundJavaScript(10000);

System.out.println(page.asXml());
}
}

When running this i got a bunch of warning and errors in the log.

(BTW: the page produces also many error/warnings when running with real browsers. Seems the maintainer of the page has a interesting view on quality)

I guess the problematic error is this one

TypeError: Cannot modify readonly property: constructor. (https://www.nba.com/ng/game/main.js#1)

There is a known bug in the javascript support of HtmlUnit (https://sourceforge.net/p/htmlunit/bugs/1897/). Because the bug is thrown from main.js i guess this will stop the processing of the page javascript before the content you are looking for is generated.

So far i found no time to fix this (looks like this has to be fixed in Rhino) but this one is on the list.

Have a look at https://twitter.com/HtmlUnit to get informed about updates.

Scraping Java Driven site with Selenium, BS

Looking at the element breadcrumbs in your image, it would appear your content is inside a frame. Each frame is treated by Selenium as a separate document. You need to switch into the frame before you can operate on its content.

 driver.switch_to.frame(0)

Above should work if it is the first frame. Then driver.page_source and your locators should begin to work.

I explained this in a little more detail in this answer



Related Topics



Leave a reply



Submit