Selenium with Scrapy for Dynamic Page

selenium with scrapy for dynamic page

It really depends on how do you need to scrape the site and how and what data do you want to get.

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy
from selenium import webdriver

class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']

def __init__(self):
self.driver = webdriver.Firefox()

def parse(self, response):
self.driver.get(response.url)

while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')

try:
next.click()

# get the data and write it to scrapy items
except:
break

self.driver.close()

Here are some examples of "selenium spiders":

  • Executing Javascript Submit form functions using scrapy in python
  • https://gist.github.com/cheekybastard/4944914
  • https://gist.github.com/irfani/1045108
  • http://snipplr.com/view/66998/

There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

  • Scraping dynamic content using python-Scrapy

How can I send Dynamic website content to scrapy with the html content generated by selenium browser?

The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.

Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from scrapy.http import HtmlResponse

class FloorSheetSpider(scrapy.Spider):
name = "nepse"

def start_requests(self):

# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver = webdriver.Chrome()

floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']

for date in floorsheet_dates:
driver.get(
"https://merolagani.com/Floorsheet.aspx")

driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
total_length = driver.find_element(By.XPATH,
"//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(1, z + 1):
driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
self.body = driver.page_source

response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)

# return an empty requests list
return []

Solution 2 - with super simple downloader middleware:

(You might have a delay here in parse method so be patient).

import scrapy
from scrapy import Request
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.common.by import By

class SeleniumMiddleware(object):
def process_request(self, request, spider):
url = spider.driver.current_url
body = spider.driver.page_source
return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)

class FloorSheetSpider(scrapy.Spider):
name = "nepse"

custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
# 'projects_name.path.to.your.pipeline': 543
}
}
driver = webdriver.Chrome()

def start_requests(self):

# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())

floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']

for date in floorsheet_dates:
self.driver.get(
"https://merolagani.com/Floorsheet.aspx")

self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
total_length = self.driver.find_element(By.XPATH,
"//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(1, z + 1):
self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
self.body = self.driver.page_source
self.url = self.driver.current_url

yield Request(url=self.url, callback=self.parse, dont_filter=True)

def parse(self, response, **kwargs):
print('test ok')
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)

Notice that I've used chrome so change it back to firefox like in your original code.

Next Page Selenium with Scrapy not working

For this particular web-site your code won't work as after you click Next button URL doesn't change. Try to wait until current page number is changed (instead of waiting for URL to change)

def parse(self,response):
self.driver.get(response.url)
current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
while True:
try:
elem = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ContentPlaceHolder1_lvDataPager1"]/a[text()="Next"]')))
elem.click()
except TimeoutException:
break
WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text != current_page_number)
current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
yield scrapy.Request(url=url, callback=self.parse_page, dont_filter=False)

Python web scraping with Selenium on Dynamic Page - Issue with looping to next element

If you study the HTML of the page, they have the onclick script which basically triggers the JS and renders the pop-up. You can make use of it.
You can find the onclick script in the child element img.
So your logic should be like (1)Get the child element (2)go to first child element (which is img always for your case) (3)Get the onclick script text (4)execute the script.

child element

for member in members:
print(member.text) # to see that loop went to next iteration
# member.find_element_by_xpath('//*[@class="projectImage"]').click()

#Begin of modification
child_elems = member.find_elements_by_css_selector("*") #Get the child elems
onclick_script = child_elems[0].get_attribute('onclick')#Get the img's onclick value
driver.execute_script(onclick_script) #Execute the JS
time.sleep(5) #Wait for some time
#end of modification
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'VIEW FULL PROFILE')))
links = driver.find_element_by_partial_link_text("VIEW FULL PROFILE")
href = links.get_attribute("href")
member_list.append(href)
member.find_element_by_xpath("/html/body/div[5]/div[1]/button").click()

print(member_list)

You need to import time module. I prefer time.sleep over wait.until, it's more easier to use when you are starting with web scraping.

Scrape urls from dynamic webpage using Scrapy

You can add logic for gathering URLs to your parse() method by gathering the css hrefs:

def parse(self, response):
self.driver.get(response.url)
pause_time = 1
last_height = self.driver.execute_script("return document.body.scrollHeight")
#start = datetime.datetime.now()
urls = []
while True:
if len(urls) <= 10000:
for href in response.css('a::attr(href)'):
urls.append(href) # Follow tutorial to learn how to use the href object as you need
else:
break # Exit your while True statement when 10,000 links have been collected
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
time.sleep(pause_time)
print("\n\n\nend\n\n\n")
new_height = self.driver.execute_script("return document.body.scrollHeight")

There's a lot of information regarding how to handle links in the scrapy tutorial following links section. You can use the information there to learn what else you can do with links in scrapy.

I haven't tested this with the infinite scroll, so you may need to make some changes, but this should get you going in the right direction.



Related Topics



Leave a reply



Submit