Screen Scraping from a Web Page with a Lot of JavaScript

Screen Scraping from a web page with a lot of Javascript

You may consider using HTMLunit
It's a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There's also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but if your pages don't use many hacks things should work fine the performance should be way better than controlling a browser. Furthermore, you don't have to worry about cookies being persisted after your scraping is over and all the other nasty things connected to controlling a browser (history, autocomplete, temp files etc).

What's a good tool to screen-scrape with Javascript support?

You could use Selenium or Watir to drive a real browser.

Ther are also some JavaScript-based headless browsers:

  • PhantomJS is a headless Webkit browser.
    • pjscrape is a scraping framework based on PhantomJS and jQuery.
    • CasperJS is a navigation scripting & testing utility bsaed on PhantomJS, if you need to do a little more than point at URLs to be scraped.
  • Zombie for Node.js

Personally, I'm most familiar with Selenium, which has support for writing automation scripts in a good number of languagues and has more mature tooling, such as the excellent Selenium IDE extension for Firefox, which can be used to write and run testcases, and can export test scripts to many languages.

How to scrape HTTPS javascript web pages

Take a look at HTMLUnit - a headless Java browser that can be fully controlled by your code. A simple example can be seen here: http://htmlunit.sourceforge.net/gettingStarted.html

(obligatory warning: by screen-scraping the site, you may be breaking its ToS, and possibly open yourself to lawsuits; check whether you are allowed to do it before you start)

How can I, scrape data from a Javascript Content of a website?

try

from msilib.schema import Error
from tkinter import ON
from turtle import goto
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import numpy as np
from random import randint
import pandas as pd
import requests
import csv

browser = webdriver.Chrome(
r'C:\Users\paart\.wdm\drivers\chromedriver\win32\97.0.4692.71\chromedriver.exe')

browser.maximize_window() # For maximizing window
browser.implicitly_wait(20) # gives an implicit wait for 20 seconds

browser.get(
"https://www.nykaa.com/nykaa-skinshield-matte-foundation/p/460512?productId=460512&pps=1&skuId=460502")

browser.execute_script("document.body.style.zoom='50%'")
time.sleep(1)
browser.execute_script("document.body.style.zoom='100%'")

# Creates "load more" button object.
browser.implicitly_wait(20)
loadMore = browser.find_element_by_xpath(xpath='//div [@class="css-mqbsar"]')
loadMore.click()

browser.implicitly_wait(20)
desc_data = browser.find_elements_by_xpath('//div[@id="content-details"]/p')

# desc_data = browser.find_elements_by_class_name('content-details')
# here in your previous code this class('content-details') which is a single element so it is not iterable
# I used xpath to locate every every element <p> under the (id="content-details) attrid=bute

for desc in desc_data:
para_detail = desc.text
print(para_detail)

# if you you want to specify try this
# para_detail = desc_data[0].text
# expiry_ date = desc_data[1].text

and don't just copy the XPath from the chrome dev tools it's not reliable for dynamic content.



Related Topics



Leave a reply



Submit