Android - Parse Js Generated Urls with Jsoup

Android - Parse JS generated urls with JSOUP

(See UPDATE below, first/accepted solution didn't met the android requirement, but is left for reference.)


Desktop Solution

HtmlUnit doesn't seem able to handle this site (often the case, lately). So I don't have a plain java solution either, but you could use PhantomJS: download the binary for your os, create a script file, start the process from within your java code and parse the output with a dom parser like jsoup.

Script file (here called simple.js):

var page = require('webpage').create();
var fs = require('fs');
var system = require('system');

var url = "";
var fileName = "output";
// first parameter: url
// second parameter: filename for output
console.log("args length: " + system.args.length);

if (system.args.length > 1) {
url=system.args[1];
}
if (system.args.length > 2){
fileName=system.args[2];
}
if(url===""){
phantom.exit();
}

page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.settings.loadImages = false;

page.open(url, function(status) {
console.log("Status: " + status);
if(status === "success") {
var path = fileName+'.html';
fs.write(path, page.content, 'w');
}
phantom.exit();
});

Java code (example to get title and cover-url):

try {
//change path to phantomjs binary and your script file
String outputFileName = "srulad";
String phantomJSPath = "phantomjs" + File.separator + "bin" + File.separator + "phantomjs";
String scriptFile = "simple.js";

String urlParameter = "http://srulad.com/#page-2";

new File(outputFileName+".html").delete();

Process process = Runtime.getRuntime().exec(phantomJSPath + " " + scriptFile + " " + urlParameter + " " + outputFileName);
process.waitFor();

Document doc = Jsoup.parse(new File(outputFileName + ".html"),"UTF-8"); // output.html is created by phantom.js, same path as page.js
Elements elements = doc.select("#list_page-2 > div");

for (Element element : elements) {
System.out.println(element.select("div.l-description.float-left > div:nth-child(1) > a").first().attr("title"));
System.out.println(element.select("div.l-image.float-left > a > img.lazy").first().attr("data-original"));
}
} catch (IOException | InterruptedException e) {
e.printStackTrace();
}

Output:

სიყვარული და მოწყალება / Love & Mercy
http://srulad.com/assets/uploads/42410_Love_and_Mercy.jpg
მუზა / The Muse
http://srulad.com/assets/uploads/43164_large_qRzsimNz0eDyFLFJcbVLIxlqii.jpg
...

UPDATE

Parsing of websites with javascript based dynamic content in Android is possible using WebView and jsoup.
The following example app uses a javascript enabled WebView to render a Javascript dependent website. With a JavascriptInterface the html source is returned, parsed with jsoup and as a proof of concept the titles and the urls to the cover-images are used to populate a ListView. The buttons decrement or increment the page number triggering an update of the ListView. Note: tested on an Android 5.1.1/API 22 device.

add internet permission to your AndroidManifest.xml

<uses-permission android:name="android.permission.INTERNET" />

activity_main.xml

<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:orientation="vertical"
android:layout_width="match_parent"
android:layout_height="match_parent">

<LinearLayout
android:orientation="horizontal"
android:layout_width="match_parent"
android:layout_height="wrap_content">

<Button
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:text="@string/page_down"
android:id="@+id/buttonDown"
android:layout_weight="0.5" />

<Button
android:layout_width="wrap_content"
android:layout_height="wrap_content"
android:text="@string/page_up"
android:id="@+id/buttonUp"
android:layout_weight="0.5" />
</LinearLayout>

<ListView
android:layout_width="match_parent"
android:layout_height="0dp"
android:id="@+id/listView"
android:layout_gravity="bottom"
android:layout_weight="0.5" />
</LinearLayout>

MainActivity.java

public class MainActivity extends AppCompatActivity {

private final Handler uiHandler = new Handler();
private ArrayAdapter<String> adapter;
private ArrayList<String> entries = new ArrayList<>();
private ProgressDialog progressDialog;

private class JSHtmlInterface {
@android.webkit.JavascriptInterface
public void showHTML(String html) {
final String htmlContent = html;

uiHandler.post(
new Runnable() {
@Override
public void run() {
Document doc = Jsoup.parse(htmlContent);
Elements elements = doc.select("#online_movies > div > div");
entries.clear();
for (Element element : elements) {
String title = element.select("div.l-description.float-left > div:nth-child(1) > a").first().attr("title");
String imgUrl = element.select("div.l-image.float-left > a > img.lazy").first().attr("data-original");
entries.add(title + "\n" + imgUrl);
}
adapter.notifyDataSetChanged();
}
}
);
}
}

@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);

ListView listView = (ListView) findViewById(R.id.listView);
adapter = new ArrayAdapter<>(this, android.R.layout.simple_list_item_1, android.R.id.text1, entries);
listView.setAdapter(adapter);

progressDialog = ProgressDialog.show(this, "Loading","Please wait...", true);
progressDialog.setCancelable(false);

try {
final WebView browser = new WebView(this);
browser.setVisibility(View.INVISIBLE);
browser.setLayerType(View.LAYER_TYPE_NONE,null);
browser.getSettings().setJavaScriptEnabled(true);
browser.getSettings().setBlockNetworkImage(true);
browser.getSettings().setDomStorageEnabled(false);
browser.getSettings().setCacheMode(WebSettings.LOAD_NO_CACHE);
browser.getSettings().setLoadsImagesAutomatically(false);
browser.getSettings().setGeolocationEnabled(false);
browser.getSettings().setSupportZoom(false);

browser.addJavascriptInterface(new JSHtmlInterface(), "JSBridge");

browser.setWebViewClient(
new WebViewClient() {

@Override
public void onPageStarted(WebView view, String url, Bitmap favicon) {
progressDialog.show();
super.onPageStarted(view, url, favicon);
}

@Override
public void onPageFinished(WebView view, String url) {
browser.loadUrl("javascript:window.JSBridge.showHTML('<html>'+document.getElementsByTagName('html')[0].innerHTML+'</html>');");
progressDialog.dismiss();
}
}
);

findViewById(R.id.buttonDown).setOnClickListener(new View.OnClickListener() {
@Override
public void onClick(View view) {
uiHandler.post(new Runnable() {
@Override
public void run() {
int page = Integer.parseInt(browser.getUrl().split("-")[1]);
int newPage = page > 1 ? page-1 : 1;
browser.loadUrl("http://srulad.com/#page-" + newPage);
browser.loadUrl(browser.getUrl()); // not sure why this is needed, but doesn't update without it on my device
if(getSupportActionBar()!=null) getSupportActionBar().setTitle(browser.getUrl());
}
});
}
});

findViewById(R.id.buttonUp).setOnClickListener(new View.OnClickListener() {
@Override
public void onClick(View view) {
uiHandler.post(new Runnable() {
@Override
public void run() {
int page = Integer.parseInt(browser.getUrl().split("-")[1]);
int newPage = page+1;
browser.loadUrl("http://srulad.com/#page-" + newPage);
browser.loadUrl(browser.getUrl()); // not sure why this is needed, but doesn't update without it on my device
if(getSupportActionBar()!=null) getSupportActionBar().setTitle(browser.getUrl());
}
});
}
});

browser.loadUrl("http://srulad.com/#page-1");
if(getSupportActionBar()!=null) getSupportActionBar().setTitle(browser.getUrl());

} catch (Exception e) {
e.printStackTrace();
}
}
}

Jsoup with JavaScript dependent web page?

After trying with no luck using Android's WebView, I accidentally found a solution in setting my userAgent, all I did was change the

    Jsoup.connect(url).get();

line to

    Jsoup.connect(url).userAgent("YOUR_USER_AGENT_HERE").get();

and it worked like a charm. Thanks for the reply anyway Fred!

Parsing web javascript content to string using android

I'd suggest looking at the network tab in chrome developer tools and then submitting the request to load up the URL ... you'll see a lot of requests going back/forth.

Two that seem to contain relevant content are:

https://merhav.nli.org.il/primo_library/libweb/webservices/rest/primo-explore/v1/pnxs?blendFacetsSeparately=false&getMore=0&inst=NNL&lang=iw_IL&limit=10&newspapersActive=false&newspapersSearch=false&offset=0&pcAvailability=true&q=any,contains,%D7%94%D7%90%D7%A8%D7%99+%D7%A4%D7%95%D7%98%D7%A8&qExclude=&qInclude=&refEntryActive=false&rtaLinks=true&scope=Local&skipDelivery=Y&sort=rank&tab=default_tab&vid=NLI

which requires a token to access token which comes from:

https://merhav.nli.org.il/primo_library/libweb/webservices/rest/v1/guestJwt/NNL?isGuest=true&lang=iw_IL&targetUrl=https%253A%252F%252Fmerhav.nli.org.il%252Fprimo-explore%252Fsearch%253Ftab%253Ddefault_tab%2526search_scope%253DLocal%2526vid%253DNLI%2526lang%253Diw_IL%2526query%253Dany%252Ccontains%252C%2525D7%252594%2525D7%252590%2525D7%2525A8%2525D7%252599%252520%2525D7%2525A4%2525D7%252595%2525D7%252598%2525D7%2525A8&viewId=NLI

.. which likely requires the JSessoinId which comes from:

https://merhav.nli.org.il/primo_library/libweb/webservices/rest/v1/configuration/NLI

.. so in order to replicate the chain of calls you could use JSoup to make these (and any other relevant) HTTP GET requests, pull out the relevant HTTP headers (typically: session, referer, accept and some other cookie values potentially)

Its not going to be straight forward, but you're essentially looking for a url on the page in one of the JSON responses from one of the network requests:

Sample Image

Once you know which request you want to recreate, you just have to work back up the list of requests and try to recreate them.

This one is not an easy one and would require a lot of time to recreate - my advice if you're going to attempt it, forget trying to parse HTML, try to rebuild/recreate the chain of 3 or so HTTP requests to the back end to get the relevant JSON and parse that. You can often pick apart the website but this ones a big job

How to read/parse dynamically generated client side content in Android using Java

Jsoup can't parse JavaScript so it can't be used here.

It can be done with Selenium webdriver or in case of Android use Selendroid.

Parsing instagram with java jsoup not give Elements gives source

If you check the source of the page (inspect the video element) you'll find:

<video class="tWeCl"
playsinline=""
poster="https://instagram.flhr4-2.fna.fbcdn.net/v/t51.2885-15/e35/117157253_120443486171759_7332785595039685871_n.jpg?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&_nc_cat=111&_nc_ohc=aX7rVh9IbGoAX_lj74j&oh=ba74c5c8ad97ba14c35710addd523dfd&oe=5F363C59"
preload="none"
type="video/mp4"
src="https://instagram.flhr4-2.fna.fbcdn.net/v/t50.2886-16/117284962_313567919762486_3343704909021624596_n.mp4?_nc_ht=instagram.flhr4-2.fna.fbcdn.net&_nc_cat=102&_nc_ohc=3wvoN4vNzkUAX_DLFTR&oe=5F3659EF&oh=7a38d593469a99239a7cb07050cc47f2">
</video>

If you then search the html for the mp4 url you'll find it in one of the javascript html tags... it is delivered as a json value. So by breaking up the javascript text on the " = " and taking the latter half, you get the raw json which can then be parsed for the "video_url" using JayWay's JsonPath.read method.

It would seem the video tag is therefore generated in the html by the javascript as it doesn't appear possible to filter the html for any <video> elements.

import com.jayway.jsonpath.JsonPath;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class Instagram {

private final String url;

public Instagram(String url) {
this.url = url;
}

public void start() {
Document doc = getHtmlPage(url);
Elements videoElement = getScriptElementContainingVideoUrl(doc);

List<String> relevantTagWithMp4Url = getSingleScriptElementWithVideoUrl(videoElement);
String scriptInnerHtml = relevantTagWithMp4Url.get(0);

System.out.println("Video Url: " + getVideoUrl(scriptInnerHtml));
}

private List<String> getSingleScriptElementWithVideoUrl(Elements scriptElements) {
List<String> relevantTagWithMp4Url = new ArrayList<>();

for (Element element : scriptElements) {
if (element.data().contains("mp4")) {
relevantTagWithMp4Url.add(element.data());
}
}

return relevantTagWithMp4Url;
}

private Elements getScriptElementContainingVideoUrl(Document doc) {
return doc.select("script");
}

private String getVideoUrl(String videoElement) {
String jsonResponse = videoElement.split(" = ")[1];
// $.. is equivalent to $.[*] - (a wild card matcher) - you may need to play with this
List<String> videoUrl = JsonPath.read(jsonResponse, "$..video_url");
return videoUrl.get(0);
}

private Document getHtmlPage(String url) {
try {
return Jsoup.connect(url).get();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}

public static void main(String[] args) {
new Instagram("https://www.instagram.com/reel/CDok74FJzHp/?igshid=cam8ylb7okl7").start();
}
}

Heed help parsing mp3 files' urls from html with jsoup

Tiarait from Russian Stackoverflow helped to find the solution. The point is that the above mentioned element is created by js. I needed to get the document body and then get the following array via splits.

var audioPlaylist = new Playlist("1", [
{name:"Chapter 01", free:true, mp3:"http://www.archive.org/download/huckleberry_mfs_librivox/huckleberry_finn_01_twain_64kb.mp3"},
{name:"Chapter 02", free:true, mp3:"http://www.archive.org/download/huckleberry_mfs_librivox/huckleberry_finn_02_twain_64kb.mp3"},
...

doInBackground method should change into this:

@Override
protected SingleBook doInBackground(Void... params) {

Document doc = null;
book = new SingleBook(imgLink, title, false, false, null, new ArrayList<String>());

Elements mLines;

try {
doc = Jsoup.connect(link).get();

} catch (IOException | RuntimeException e) {
e.printStackTrace();
}
if (doc != null) {

mLines = doc.getElementsByClass("book-description");

for (Element mLine : mLines) {
String description= mLine.text();
book.setDescription(description);

}

String arr = "";
String html = doc.body().html();
if (html.contains("var audioPlaylist = new Playlist(\"1\", ["))
arr = html.split("var audioPlaylist = new Playlist\\(\"1\", \\[")[1];
if (arr.contains("]"))
arr = arr.split("\\]")[0];
//-----------------------------------------
if (arr.contains("},{")) {
for (String mLine2 : arr.split("\\},\\{")) {
if (mLine2.contains("mp3:\""))
tmpChapters.add(mLine2.split("mp3:\"")[1].split("\"")[0]);
}
} else if (arr.contains("mp3:\""))
tmpChapters.add(arr.split("mp3:\"")[1].split("\"")[0]);
}else
System.out.println("ERROR");

book.setChapters(tmpChapters);
return book;

}



Related Topics



Leave a reply



Submit