Reading Dynamically Generated Web Pages Using Python

Reading dynamically generated web pages using python

You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you

http://code.google.com/p/spynner/

http://phantomjs.org/

http://zombie.labnotes.org/

http://github.com/ryanpetrello/python-zombie

http://jeanphix.me/Ghost.py/

http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/

How do I scrape pages with dynamically generated URLs using Python?

I like this question. And because of that, I'll give a very thorough answer. For this, I'll use my favorite Requests library along with BeautifulSoup4. Porting over to Mechanize if you really want to use that is up to you. Requests will save you tons of headaches though.


First off, you're probably looking for a POST request. However, POST requests are often not needed if a search function brings you right away to the page you're looking for. So let's inspect it, shall we?

When I land on the base URL, http://www.dailyfinance.com/, I can do a simple check via Firebug or Chrome's inspect tool that when I put in CSCO or AAPL on the search bar and enable the "jump", there's a 301 Moved Permanently status code. What does this mean?

Sample Image

In simple terms, I was transferred somewhere. The URL for this GET request is the following:

http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input=CSCO

Now, we test if it works with AAPL by using a simple URL manipulation.

import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
print r.url

The above gives the following result:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
[Finished in 2.3s]

See how the URL of the response changed? Let's take the URL manipulation one step further by looking for the /financial-ratios page by appending the below to the above code:

new_url = r.url + "/financial-ratios"
p = rq.get(new_url)
print p.url

When ran, this gives is the following result:

http://www.dailyfinance.com/quote/nasdaq/apple/aapl
http://www.dailyfinance.com/quote/nasdaq/apple/aapl/financial-ratios
[Finished in 6.0s]

Now we're on the right track. I will now try to parse the data using BeautifulSoup. My complete code is as follows:

from bs4 import BeautifulSoup as bsoup
import requests as rq

apl_tick = "AAPL"
url = "http://www.dailyfinance.com/quote/jump?exchange-input=&ticker-input="
r = rq.get(url + apl_tick)
new_url = r.url + "/financial-ratios"
p = rq.get(new_url)

soup = bsoup(p.content)
div = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row

I then try running this code, only to encounter an error with the following traceback:

  File "C:\Users\nanashi\Desktop\test.py", line 13, in <module>
div = soup.find("div", id="clear").table
AttributeError: 'NoneType' object has no attribute 'table'

Of note is the line 'NoneType' object.... This means our target div does not exist! Egads, but why am I seeing the following?!

Sample Image

There can only be one explanation: the table is loaded dynamically! Rats. Let's see if we can find another source for the table. I study the page and see that there are scrollbars at the bottom. This might mean that the table was loaded inside a frame or was loaded straight from another source entirely and placed into a div in the page.

I refresh the page and watch the GET requests again. Bingo, I found something that seems a bit promising:

Sample Image

A third-party source URL, and look, it's easily manipulable using the ticker symbol! Let's try loading it into a new tab. Here's what we get:

Sample Image

WOW! We now have the very exact source of our data. The last hurdle though is will it work when we try to pull the CSCO data using this string (remember we went CSCO -> AAPL and now back to CSCO again, so you're not confused). Let's clean up the string and ditch the role of www.dailyfinance.com here completely. Our new url is as follows:

http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US=AAPL

Let's try using that in our final scraper!

from bs4 import BeautifulSoup as bsoup
import requests as rq

csco_tick = "CSCO"
url = "http://www.motleyfool.idmanagedsolutions.com/stocks/financial_ratios.idms?SYMBOL_US="
new_url = url + csco_tick

r = rq.get(new_url)
soup = bsoup(r.content)

table = soup.find("div", id="clear").table
rows = table.find_all("tr")
for row in rows:
print row.get_text()

And our raw results for CSCO's financial ratios data is as follows:

Company
Industry

Valuation Ratios

P/E Ratio (TTM)
15.40
14.80

P/E High - Last 5 Yrs
24.00
28.90

P/E Low - Last 5 Yrs
8.40
12.10

Beta
1.37
1.50

Price to Sales (TTM)
2.51
2.59

Price to Book (MRQ)
2.14
2.17

Price to Tangible Book (MRQ)
4.25
3.83

Price to Cash Flow (TTM)
11.40
11.60

Price to Free Cash Flow (TTM)
28.20
60.20

Dividends

Dividend Yield (%)
3.30
2.50

Dividend Yield - 5 Yr Avg (%)
N.A.
1.20

Dividend 5 Yr Growth Rate (%)
N.A.
144.07

Payout Ratio (TTM)
45.00
32.00

Sales (MRQ) vs Qtr 1 Yr Ago (%)
-7.80
-3.70

Sales (TTM) vs TTM 1 Yr Ago (%)
5.50
5.60

Growth Rates (%)

Sales - 5 Yr Growth Rate (%)
5.51
5.12

EPS (MRQ) vs Qtr 1 Yr Ago (%)
-54.50
-51.90

EPS (TTM) vs TTM 1 Yr Ago (%)
-54.50
-51.90

EPS - 5 Yr Growth Rate (%)
8.91
9.04

Capital Spending - 5 Yr Growth Rate (%)
20.30
20.94

Financial Strength

Quick Ratio (MRQ)
2.40
2.70

Current Ratio (MRQ)
2.60
2.90

LT Debt to Equity (MRQ)
0.22
0.20

Total Debt to Equity (MRQ)
0.31
0.25

Interest Coverage (TTM)
18.90
19.10

Profitability Ratios (%)

Gross Margin (TTM)
63.20
62.50

Gross Margin - 5 Yr Avg
66.30
64.00

EBITD Margin (TTM)
26.20
25.00

EBITD - 5 Yr Avg
28.82
0.00

Pre-Tax Margin (TTM)
21.10
20.00

Pre-Tax Margin - 5 Yr Avg
21.60
18.80

Management Effectiveness (%)

Net Profit Margin (TTM)
17.10
17.65

Net Profit Margin - 5 Yr Avg
17.90
15.40

Return on Assets (TTM)
8.30
8.90

Return on Assets - 5 Yr Avg
8.90
8.00

Return on Investment (TTM)
11.90
12.30

Return on Investment - 5 Yr Avg
12.50
10.90

Efficiency

Revenue/Employee (TTM)
637,890.00
556,027.00

Net Income/Employee (TTM)
108,902.00
98,118.00

Receivable Turnover (TTM)
5.70
5.80

Inventory Turnover (TTM)
11.30
9.70

Asset Turnover (TTM)
0.50
0.50

[Finished in 2.0s]

Cleaning up the data is up to you.


One good lesson to learn from this scrape is not all data are contained in one page alone. It's pretty nice to see it coming from another static site. If it was produced via JavaScript or AJAX calls or the like, we would likely have some difficulties with our approach.

Hopefully you learned something from this. Let us know if this helps and good luck.

How do i output a dynamically generated web page to a .html page instead of .py cgi page?

First, I'd suggest that you remember that URLs are URLs and that file extensions don't matter, and that you should just leave it.

If that isn't enough, then remember that URLs are URLs and that file extensions don't matter — and configure Apache to use a different rule to determine that is a CGI program rather than a static file to be served up as is. You can use AddHandler to add a handler for files on the hard disk with a .html extension.

Alternatively, you could use mod_rewrite to tell Apache that …/foo.html means …/foo.py

Finally, I'd suggest that if you do muck around with what URLs look like, that you remove any sign of something that looks like a file extension (so that …/foo is requested rather then …/foo.anything).

As for keeping the user on the same address for results as for the request … that is just a matter of having the program output the basic page without results if it doesn't get the query string parameters that indicate a search term had been passed.

How do I scrape content from a dynamically generated page using selenium and python?

I think you are looking for something like

browser.find_elements_by_css_selector('[class*="product-information__Title"]')

This should find all elements with a class beginning with that string.

Scraping data from a dynamic web table

The data is loaded dynamically from different URL. You can use this example how to load it just with requests/beautifulsoup:

import json
import requests
from bs4 import BeautifulSoup

data = {
"sort": "Einfahrtzeit-desc",
"page": "1",
"pageSize": "10",
"group": "",
"filter": "",
"__RequestVerificationToken": "",
"locid": "1",
}

headers = {"X-Requested-With": "XMLHttpRequest"}

url = "https://www.laerm-monitoring.de/zug/"
api_url = "https://www.laerm-monitoring.de/zug/train_read"

with requests.Session() as s:
soup = BeautifulSoup(s.get(url).content, "html.parser")
data["__RequestVerificationToken"] = soup.select_one(
'[name="__RequestVerificationToken"]'
)["value"]
data = s.post(api_url, data=data, headers=headers).json()

# pretty print the data
print(json.dumps(data, indent=4))

Prints:

{
"Data": [
{
"id": 2536954,
"Einfahrtzeit": "2021-04-24T20:56:26.1703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 7.3,
"Zugl\u00e4nge": 181.85884,
"Geschwindigkeit": 115.57797,
"Maximalpegel": 88.611084,
"Vorbeifahrtpegel": 85.421326,
"G\u00fcltig": "OK"
},
{
"id": 2536944,
"Einfahrtzeit": "2021-04-24T20:52:25.1703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 6.3,
"Zugl\u00e4nge": 211.10226,
"Geschwindigkeit": 152.60104,
"Maximalpegel": 91.81743,
"Vorbeifahrtpegel": 87.95224,
"G\u00fcltig": "OK"
},
{
"id": 2536929,
"Einfahrtzeit": "2021-04-24T20:44:31.4703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 5.3,
"Zugl\u00e4nge": 104.69964,
"Geschwindigkeit": 110.10052,
"Maximalpegel": 82.100815,
"Vorbeifahrtpegel": 79.98168,
"G\u00fcltig": "OK"
},
{
"id": 2536924,
"Einfahrtzeit": "2021-04-24T20:42:30.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 2.9,
"Zugl\u00e4nge": 49.305683,
"Geschwindigkeit": 125.18,
"Maximalpegel": 98.63289,
"Vorbeifahrtpegel": 97.25019,
"G\u00fcltig": "OK"
},
{
"id": 2536925,
"Einfahrtzeit": "2021-04-24T20:42:20.5703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 0.0,
"Zugl\u00e4nge": 0.0,
"Geschwindigkeit": 0.0,
"Maximalpegel": 0.0,
"Vorbeifahrtpegel": 0.0,
"G\u00fcltig": "-"
},
{
"id": 2536911,
"Einfahrtzeit": "2021-04-24T20:35:19.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 4.1,
"Zugl\u00e4nge": 103.97647,
"Geschwindigkeit": 132.2034,
"Maximalpegel": 87.111984,
"Vorbeifahrtpegel": 85.6776,
"G\u00fcltig": "OK"
},
{
"id": 2536907,
"Einfahrtzeit": "2021-04-24T20:33:31.2703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "GZ",
"Zugkategorie": "G\u00fcterzug",
"Vorbeifahrtdauer": 23.8,
"Zugl\u00e4nge": 583.19586,
"Geschwindigkeit": 95.63598,
"Maximalpegel": 88.02967,
"Vorbeifahrtpegel": 85.02115,
"G\u00fcltig": "OK"
},
{
"id": 2536890,
"Einfahrtzeit": "2021-04-24T20:25:36.1703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 3.5,
"Zugl\u00e4nge": 104.63446,
"Geschwindigkeit": 160.47487,
"Maximalpegel": 88.60612,
"Vorbeifahrtpegel": 86.46721,
"G\u00fcltig": "OK"
},
{
"id": 2536882,
"Einfahrtzeit": "2021-04-24T20:22:05.8703+02:00",
"Gleis": 2,
"Richtung": "Hamburg",
"Category": "GZ",
"Zugkategorie": "G\u00fcterzug",
"Vorbeifahrtdauer": 26.6,
"Zugl\u00e4nge": 653.52515,
"Geschwindigkeit": 94.59859,
"Maximalpegel": 91.9396,
"Vorbeifahrtpegel": 85.50632,
"G\u00fcltig": "OK"
},
{
"id": 2536869,
"Einfahrtzeit": "2021-04-24T20:16:24.3703+02:00",
"Gleis": 1,
"Richtung": "Kiel",
"Category": "PZ",
"Zugkategorie": "Personenzug",
"Vorbeifahrtdauer": 3.3,
"Zugl\u00e4nge": 87.8222,
"Geschwindigkeit": 160.01207,
"Maximalpegel": 91.3928,
"Vorbeifahrtpegel": 89.54336,
"G\u00fcltig": "OK"
}
],
"Total": 8657,
"AggregateResults": null,
"Errors": null
}

How to retrieve the values of dynamic html content using Python

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).

This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:

  1. parse the ajax JSON directly
  2. use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
  3. use a browser automation tool splinter

This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.


EDIT

From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')

html = driver.page_source
soup = BeautifulSoup(html)

# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text

Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)

Scraping dynamic content in a website

The polite option would be to ask the owners of the site if they have an API which allows you access to their news stories.

The less polite option would be to trace the HTTP transactions that take place while the page is loading and work out which one is the AJAX call which pulls in the data.

Looks like it's this one. But it looks like it might contain session data, so I don't know how long it will continue to work for.



Related Topics



Leave a reply



Submit