Scraping an AngularJS application
If you're trying to scrape AngularJS pages in a fully generic fashion, then you're likely going to need something like what @tadman mentioned in the comments (PhantomJS) -- some type of headless browser that fully processes the AngularJS JavaScript and opens the DOM up to inspection afterwards.
If you have a specific site or sites that you are looking to scrape, the path of least resistance is likely to avoid the AngularJS frontend entirely and directly query the API from which the Angular code is pulling content. The standard scenario for many/most AngularJS sites is that they pull down the static JS and HTML code/templates, and then they make ajax calls back to a server (either their own, or some third party API) to get content that will be rendered. If you take a look at their code, you can likely directly query whatever angular is calling (i.e. via $http, ngResource, or restangular). The return data is typically JSON and would be much easier to gather vs. true scraping in the post-rendered html result.
Web scraping - how to access content rendered in JavaScript via Angular.js?
This page use JavaScript
to read data from server and fill page.
I see you use developer tools in Chrome
- see in tab Network
on XHR
or JS
requests.
I found this url:
http://data.asx.com.au/data/1/company/ACB?fields=primary_share,latest_annual_reports,last_dividend,primary_share.indices&callback=angular.callbacks._0
This url gives all data almost in JSON format
But if you use this link without &callback=angular.callbacks._0
then you get data in pure JSON format and you will could use json
module to convert it to python dictionary.
EDIT: working code
import urllib2
import json
# new url
url = 'http://data.asx.com.au/data/1/company/ACB?fields=primary_share,latest_annual_reports,last_dividend,primary_share.indices'
# read all data
page = urllib2.urlopen(url).read()
# convert json text to python dictionary
data = json.loads(page)
print(data['principal_activities'])
Output:
Mineral exploration in Botswana, China and Australia.
EDIT (2020.12.23)
This answer is almost 5 years old and was created for Python2. Now in Python3 it would need urllib.request.urlopen()
or requests.get()
but real problem is that for 5 years this page changed structure and technologie. Urls (in question and answer) don't exists any more. This page would need new analyze and new method.
In question was url
http://www.asx.com.au/asx/research/company.do#!/ACB/details
but currently page uses url
https://www2.asx.com.au/markets/company/acb
And it use different urls for AJAX
,XHR
https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/about
https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/announcements
https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/key-statistics
etc.
You can find more urls using DevTools
in Chrome
/Firefox
(tab: Network
, filter: XHR
)
import urllib.request
import json
# new url
url = 'https://asx.api.markitdigital.com/asx-research/1.0/companies/acb/about'
# read all data
page = urllib.request.urlopen(url).read()
# convert json text to python dictionary
data = json.loads(page)
print(data['data']['description'])
Output:
Minerals exploration & development
scrape an angularjs website with java
In the end, I have followed Madusudanan 's excellent advise and I looked into PhantomJS / Selenium combination. And there actually is a solution! Its called PhantomJSDriver.
You can find the maven dependency here. Here is more info on ghost driver.
The setup in Maven- I have added the following:
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.41.0</version>
</dependency>
<dependency>
<groupId>com.github.detro</groupId>
<artifactId>phantomjsdriver</artifactId>
<version>1.2.0</version>
</dependency>
It also runs with Selenium version 2.45 which is the latest version up until now. I am mentioning this, because of some articles I read in which people say that the Phantom driver isn't compatible with every version of Selenium, but I guess they addressed that problem in the meantime.
If you are already using a Selenium/Phantomdriver combination and you are getting 'strict javascript errors' on a certain site, update your version of selenium. That will fix it.
And here is some sample code:
public void testPhantomDriver() throws Exception {
DesiredCapabilities options = new DesiredCapabilities();
// the website i am scraping uses ssl, but I dont know what version
options.setCapability(PhantomJSDriverService.PHANTOMJS_CLI_ARGS, new String[] {
"--ssl-protocol=any"
});
PhantomJSDriver driver = new PhantomJSDriver(options);
driver.get("https://www.mywebsite");
List<WebElement> elements = driver.findElementsByClassName("media-title");
for(WebElement element : elements ){
System.out.println(element.getText());
}
driver.quit();
}
How can I Webscrape in python with a website that uses AngularJS ng-include
The data you see on the page is loaded from external URL. This script uses requests
/json
module to parse it:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://api-us.melaleuca.com/search/v3'
params = {
'id': 'products_by_category_facets',
'index': 'en_us',
'params': json.dumps(
{"sortField":"","sortDir":"","from":0,"size":1000,"categoryId":"52","filters":[]}
)
}
headers = {
'Referer': 'https://www.melaleuca.com/ProductStore/content/category?c=52'
}
data = requests.get(url, params=params, headers=headers).json()
# uncomment this to print all data:
print(json.dumps(data, indent=4))
# print some data to screen:
for h in data['hits']['hits']:
s = BeautifulSoup(h['_source']['name']['enhanced'], 'html.parser')
print(s.text)
print(h['_source']['sku'])
print('-' * 80)
Prints:
EcoSense® Clean Home Pack Save over $38.00
8219
--------------------------------------------------------------------------------
EcoSense Laundry Value 4-Pack with 96 Load bottles Save up to $10.00
6012
--------------------------------------------------------------------------------
Limited Time EcoSense Laundry Regular 4-Pack with 48 Load bottles Save $4.75
6013
--------------------------------------------------------------------------------
EcoSense® Kitchen Pack Save up to $4.00 (Mixing Spray Bottle not included)
8220
--------------------------------------------------------------------------------
EcoSense® Bathroom Pack Save $4.00 (Mixing Spray Bottles not included)
5679
--------------------------------------------------------------------------------
MelaSoft® Fabric Softener 96 Load 2-Pack Save $2.00
2154
--------------------------------------------------------------------------------
...and so on.
Related Topics
Rotating Letters in a String So That Each Letter Is Shifted to Another Letter by N Places
Rails - Multi Tenant Application with Customization Framework
How to Use Headless Chrome with Capybara and Selenium
How to Raise an Exception in an Rspec Test
Ruby Getting the Longest Word of a Sentence
How to Compare Private Attributes in Ruby
How to Write Down the Rspec to Test Rescue Block
Rugged Gem Installation Issue Possibly with Libgit2 on Windows
Accessing a Has_One Associations' Attributes
Updating from Rails 4.0 to 4.1 Gives SASS-Rails Railties Version Conflicts
Pg::Error: Error: Invalid Byte Sequence for Encoding "Utf8": 0Xfc
How to Use H2 as Embedded Database in Postgres-Compat Mode, from Jruby/Rails
Monitor Ruby Processes with Monit
Can't Run Ruby 2.2.3 with Rvm on Osx
What's the Nature of "Property" in a Ruby Class
How to Specify a Regex Character Range That Will Work in European Languages Other Than English