Android Web Scraping with a Headless Browser
Ok after 2 weeks I admit defeat and are using a workaround which works great for me at the moment.
The problem:
It is too difficult to port HTMLUnit to Android (or at least with my level of expertise). I am sure its a worthwhile project (and not that time consuming for experienced java programmer) . I emailed the guys at HTMLUnit and they commented that they are not looking into a port or what effort will be involved but suggested anyone who wants to start with such a project should send an message to their mailing list to get more developers involved (http://htmlunit.sourceforge.net/mail-lists.html).
The workaround:
I used android's built in WebView and overrided the onPageFinished method of Webview class to inject Javascript that grabs all the html after the page has fully loaded. Webview can also be used to called futher javascript actions, clicking buttons, filling in forms etc.
Code:
webView.getSettings().setJavaScriptEnabled(true);
MyJavaScriptInterface jInterface = new MyJavaScriptInterface();
webView.addJavascriptInterface(jInterface, "HtmlViewer");
webView.setWebViewClient(new WebViewClient() {
@Override
public void onPageFinished(WebView view, String url) {
//Load HTML
webView.loadUrl("javascript:window.HtmlViewer.showHTML('<html>'+document.getElementsByTagName('html')[0].innerHTML+'</html>');");
}
}
webView.loadUrl(StartURL);
ParseHtml(jInterface.html);
public class MyJavaScriptInterface {
public String html;
@JavascriptInterface
public void showHTML(String _html) {
html = _html;
}
}
Scrape a dynamically-produced page on Android
Selenium would be a good option for web scraping. https://www.selenium.dev/ It basically has access to the website's DOM. In past experience, a dynamically generated web page can be difficult to scrape. RegExp will be your friend. https://regexone.com/
Selendroid as a web scraper
Unfortunately I didn't get Selendroid to work. But I find a workaround to scrape dynamic content by using just Android's built in WebView with JavaScript enabled.
mWebView = new WebView();
mWebView.getSettings().setJavaScriptEnabled(true);
mWebView.addJavascriptInterface(new HtmlHandler(), "HtmlHandler");
mWebView.setWebViewClient(new WebViewClient() {
@Override
public void onPageFinished(WebView view, String url) {
super.onPageFinished(view, url);
if (url == urlToLoad) {
// Pass html source to the HtmlHandler
WebView.loadUrl("javascript:HtmlHandler.handleHtml(document.documentElement.outerHTML);");
}
});
The JS method document.documentElement.outerHTML
will retrieve the full html contained in the loaded url. Then the retrived html string is sent to handleHtml method in HtmlHandler class.
class HtmlHandler {
@JavascriptInterface
@SuppressWarnings("unused")
public void handleHtml(String html) {
// scrape the content here
}
}
You may use a library like Jsoup to scrape the necessary content from the html String.
Related Topics
Private Final Static Attribute VS Private Final Attribute
Easiest Way to Convert a List to a Set in Java
Configuring Log4J Loggers Programmatically
How to Capitalize the First Letter of a String in Java
Different War Files, Shared Resources
Javafx CSS Error ( Property Stylesheets Does Not Exist )
Javafx - How to Create a Thin Menubar
Javafx - What Is This Weird Container That Holds Context Menu
Prevent Android Activity Dialog from Closing on Outside Touch
Differencebetween ? and Object in Java Generics
How to Change Webservice Url Endpoint
Getting Java Gui to Open a Webpage in Web Browser
How to Populate a Drop Down with a List Using Thymeleaf and Spring
Getting Jsoup to Support Dynamically Generated HTML by JavaScript