Scrape Web Page Data Generated by JavaScript

Scrape web page data generated by javascript

You need to look at PhantomJS.

From their site:

PhantomJS is a headless WebKit with JavaScript API. It has fast and
native support for various web standards: DOM handling, CSS selector,
JSON, Canvas, and SVG.

Using the API you can script the "browser" to interact with that page and scrape the data you need. You can then do whatever you need with it; including passing it to a PHP script if necessary.

That being said, if at all possible try not to "scrape" the data. If there is an ajax call the page is making, maybe there is an API you can use instead? If not, maybe you can convince them to make one. That would of course be much easier and more maintainable than screen scraping.

web scraping of tables generated using JavaScript

I saw no robots.txt nor a T&C but I did read through the (quite daunting) "APPLICATION TO USE RESTRICTED MICRODATA" (I forgot I had an account that can access IPUMS though I don't recall ever using it). I'm impressed at their desire to register the importance of the potentially sensitive nature of their data up front before download.

Since this metadata has no "microdata" in it (it appears the metadata is provided to help folks decide what data elements they can select) and since acquisition & use of it doesn't violate any of the stated restrictions, the following should be OK. If a rep of IPUMS sees this and disagrees, I'll gladly remove the answer and ask the SO admins to really delete it, too (for those who aren't aware, folks w/high enough rep can see deleted answers).

Now, you don't need Selenium or Splash for this but you'll need to do some post-processing of the data retrieved by the below code.

The data that builds the metadata tables is in a javascript blob in a <script> tag (Use "View Source" to see it, you're going to need it later). We can use some string munging & the V8 package to get it:

library(V8)
library(rvest)
library(jsonlite)
library(stringi)

pg <- read_html("https://international.ipums.org/international-action/variables/MIGYRSBR#codes_section")

html_nodes(pg, xpath=".//script[contains(., 'Less than')]") %>% 
  html_text() %>% 
  stri_split_lines() %>% 
  .[[1]] -> js_lines

idx <- which(stri_detect_fixed(js_lines, '$(document).ready(function() {')) - 1

That finds the target <script> element, gets the contents, converts it to lines and finds the first line that isn't the data. We can only pull out the javascript code with the data since the V8 engine in R isn't a full browser and can't execute the jQuery code after it.

We now create a "V8 context", extract the code and execute it in said V8 context and retrieve it back:

ctx <- v8()

ctx$eval(paste0(js_lines[1:idx], collapse="\n"))

code_data <- ctx$get("codeData")

str(code_data)
## List of 14
##  $ jsonPath                  : chr "/international-action/frequencies/MIGYRSBR"
##  $ samples                   :'data.frame': 6 obs. of  2 variables:
##   ..$ name: chr [1:6] "br1960a" "br1970a" "br1980a" "br1991a" ...
##   ..$ id  : int [1:6] 2416 2417 2418 2419 2420 2651
##  $ categories                :'data.frame': 100 obs. of  5 variables:
##   ..$ id     : int [1:100] 4725113 4725114 4725115 4725116 4725117 4725118 4725119 4725120 4725121 4725122 ...
##   ..$ label  : chr [1:100] "Less than 1 year" "1" "2" "3" ...
##   ..$ indent : int [1:100] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ code   : chr [1:100] "00" "01" "02" "03" ...
##   ..$ general: logi [1:100] FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ longSamplesHeader         : chr "<tr class=\"fullHeader grayHeader\">\n\n          <th class=\"codesColumn\">Code</th>\n          <th class=\"la"| __truncated__
##  $ samplesHeader             : chr "\n<tr class=\"fullHeader grayHeader\">\n      <th class=\"codesColumn\">Code</th>\n      <th class=\"labelColum"| __truncated__
##  $ showCounts                : logi FALSE
##  $ generalWidth              : int 2
##  $ width                     : int 2
##  $ interval                  : int 25
##  $ isGeneral                 : logi FALSE
##  $ frequencyType             : NULL
##  $ project_uses_survey_groups: logi FALSE
##  $ variables_show_tab_1      : chr ""
##  $ header_type               : chr "short"

The jsonPath component suggests it uses more data in the building of the codes & frequencies tables, so we can get it, too:

code_json <- fromJSON(sprintf("https://international.ipums.org%s", code_data$jsonPath))

str(code_json, 1)
## List of 6
##  $ 2416:List of 100
##  $ 2417:List of 100
##  $ 2418:List of 100
##  $ 2419:List of 100
##  $ 2420:List of 100
##  $ 2651:List of 100

Those "Lists of 100" are 100 numbers each.

You'll need to look at the code in the "View Source" (as suggested above) to see how you might be able to use those two bits of data to re-create the metadata table.

I do think you'd be better off following the path @alistaire started you on but follow it fully. I saw no questions about obtaining "codes and frequencies" or "metadata" (such as this) in the forum (http://answers.popdata.org/) and read in at least 5 places the the IPUMS staff reads and answers questions in the forums and also at their info-email address: ipums@umn.edu.

They obviously have this metadata somewhere electronically and could likely give you a complete dump of it across all data products to avoid further scraping (which my guess is your goal since I can't imagine a scenario where one wld want to go through this trouble for one extract).

Python Scraping JavaScript page without the need of an installed browser

Aside from automating a browser your other 2 options are as follows:

try find the backend query that loads the data via javascript. It's not a guarantee that it will exist but open your browser's Developer Tools - Network tab - fetch/Xhr and then refresh the page, hopefully you'll see requests to a backend api that loads the data you want. If you do find a request click on it and explore the endpoint, headers and possibly the payload that is sent to get the response you are looking for, these can all be recreated in python using requests to that hidden endpoint.
the other possiblility is that the data hidden in the HTML within a script tag possibly in a json file... Open the Elements tab of your developer tools where you can see the HTML of the page, right click on the tag and click "expand recursively" this will open every tag (it might take a second) and you'll be able to scroll down and search for the data you want. Ignore the regular HTML tags, we know it is loaded by javascript so look through any "script" tag. If you do find it then you can hopefully find it in your script with a combination of Beautiful Soup to get the script tag and string slicing to just get out the json.

If neither of those produce results then try requests_html package, and specifically the "render" method. It automatically installs a headless browser when you first run the render method in your script.

What site is it, perhaps I can offer more help if I can see it?

scrape html generated by javascript with python

In Python, I think Selenium 1.0 is the way to go. It’s a library that allows you to control a real web browser from your language of choice.

You need to have the web browser in question installed on the machine your script runs on, but it looks like the most reliable way to programmatically interrogate websites that use a lot of JavaScript.

Web scraping with python in javascript dynamic website

The website does 3 API calls in order to get the data.

The code below does the same and get the data.

(In the browser do F12 -> Network -> XHR and see the API calls)

import requests

payload1 = {'language':'ca','documentId':680124}
r1 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListTraceabilityStandard',data = payload1)
if r1.status_code == 200:
  print(r1.json())

print('------------------')
payload2 = {'documentId':680124,'orderBy':'DESC','language':'ca','traceability':'02'}
r2 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/getListValidityByDocument',data = payload2)
if r2.status_code == 200:
  print(r2.json())

print('------------------')

payload3 = {'documentId': 680124,'traceabilityStandard': '02','language': 'ca'}
r3 = requests.post('https://portaldogc.gencat.cat/eadop-rest/api/pjc/documentPJC',data=payload3)
if r3.status_code == 200:
  print(r3.json())

How can I scrape from a webpage that uses javascript to load in elements as you scroll?

The data is actually present in the page source. See view-source:https://www.smogon.com/dex/ss/pokemon/ (It is present inside on the script tag as a javascript variable).

import requests
import re
import json


response = requests.get('https://www.smogon.com/dex/ss/pokemon/')

# The following regex will help you take the json string from the response text
data = "".join(re.findall(r'dexSettings = (\{.*\})', response.text))

# the above will only return a string, we need to parse that to json in order to process it as a regular json object using `json.loads()`
data = json.loads(data)

# now we can query json string like below.
data = data.get('injectRpcs', [])[1][1].get('items', [])

for row in data:
  print(row.get('name', ''))
  print(row.get('description', ''))

See it in action here

Scrape Web Page Data Generated by JavaScript