Scraping an Angularjs Application

Scraping an AngularJS application

If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what @tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.

If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.

Web scraping - how to access content rendered in JavaScript via Angular.js?

This page use JavaScript to read data from server and fill page.

I see you use developer tools in Chrome - see in tab Network on XHR or JS requests.

I found this url:

http://data.asx.com.au/data/1/company/ACB?fields=primary_share,latest_annual_reports,last_dividend,primary_share.indices&callback=angular.callbacks._0

This url gives all data almost in JSON format

But if you use this link without &callback=angular.callbacks._0 then you get data in pure JSON format and you will could use json module to convert it to python dictionary.


EDIT: working code

import urllib2
import json

# new url
url = 'http://data.asx.com.au/data/1/company/ACB?fields=primary_share,latest_annual_reports,last_dividend,primary_share.indices'

# read all data
page = urllib2.urlopen(url).read()

# convert json text to python dictionary
data = json.loads(page)

print(data['principal_activities'])

Output:

Mineral exploration in Botswana, China and Australia.

EDIT (2020.12.23)

This answer is almost 5 years old and was created for Python2. Now in Python3 it would need urllib.request.urlopen() or requests.get() but real problem is that for 5 years this page changed structure and technologie. Urls (in question and answer) don't exists any more. This page would need new analyze and new method.

In question was url

http://www.asx.com.au/asx/research/company.do#!/ACB/details

but currently page uses url

https://www2.asx.com.au/markets/company/acb

And it use different urls for AJAX,XHR

https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/about

https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/announcements

https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/key-statistics

etc.

You can find more urls using DevTools in Chrome/Firefox (tab: Network, filter: XHR)

import urllib.request
import json

# new url
url = 'https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/about'

# read all data
page = urllib.request.urlopen(url).read()

# convert json text to python dictionary
data = json.loads(page)

print(data['data']['description'])

Output:

Minerals exploration & development

scrape an angularjs website with java

In the end, I have followed Madusudanan 's excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.

You can find the maven dependency here. Here is more info on ghost driver.

The setup in Maven- I have added the following:

<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.41.0</version>
</dependency>
<dependency>
<groupId>com.github.detro</groupId>
<artifactId>phantomjsdriver</artifactId>
<version>1.2.0</version>
</dependency>

It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn't compatible with every version of Selenium, but I guess they addressed that problem in the meantime.

If you are already using a Selenium/Phantomdriver combination and you are getting 'strict javascript errors' on a certain site, update your version of selenium. That will fix it.

And here is some sample code:

public void testPhantomDriver() throws Exception {
DesiredCapabilities options = new DesiredCapabilities();
// the website i am scraping uses ssl, but I dont know what version
options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] {
"--ssl-protocol=any"
});

PhantomJSDriver driver = new PhantomJSDriver(options);

driver.get("https://www.mywebsite");

List<WebElement> elements = driver.findElementsByClassName("media-title");

for(WebElement element : elements ){
System.out.println(element.getText());
}

driver.quit();
}

How can I Webscrape in python with a website that uses AngularJS ng-include

The data you see on the page is loaded from external URL. This script uses requests/json module to parse it:

import re
import json
import requests
from bs4 import BeautifulSoup

url = 'https://api-us.melaleuca.com/search/v3'
params = {
'id': 'products_by_category_facets',
'index': 'en_us',
'params': json.dumps(
{"sortField":"","sortDir":"","from":0,"size":1000,"categoryId":"52","filters":[]}
)
}
headers = {
'Referer': 'https://www.melaleuca.com/ProductStore/content/category?c=52'
}

data = requests.get(url, params=params, headers=headers).json()

# uncomment this to print all data:
print(json.dumps(data, indent=4))

# print some data to screen:
for h in data['hits']['hits']:
s = BeautifulSoup(h['_source']['name']['enhanced'], 'html.parser')
print(s.text)
print(h['_source']['sku'])
print('-' * 80)

Prints:

EcoSense® Clean Home Pack Save over $38.00 
8219
--------------------------------------------------------------------------------
EcoSense Laundry Value 4-Pack with 96 Load bottles Save up to $10.00
6012
--------------------------------------------------------------------------------
Limited Time EcoSense Laundry Regular 4-Pack with 48 Load bottles Save $4.75
6013
--------------------------------------------------------------------------------
EcoSense® Kitchen Pack Save up to $4.00 (Mixing Spray Bottle not included)
8220
--------------------------------------------------------------------------------
EcoSense® Bathroom Pack Save $4.00 (Mixing Spray Bottles not included)
5679
--------------------------------------------------------------------------------
MelaSoft® Fabric Softener 96 Load 2-Pack Save $2.00
2154
--------------------------------------------------------------------------------

...and so on.


Related Topics



Leave a reply



Submit