selenium with scrapy for dynamic page
It really depends on how do you need to scrape the site and how and what data do you want to get.
Here's an example how you can follow pagination on ebay using Scrapy
+Selenium
:
import scrapy
from selenium import webdriver
class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
self.driver.get(response.url)
while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')
try:
next.click()
# get the data and write it to scrapy items
except:
break
self.driver.close()
Here are some examples of "selenium spiders":
- Executing Javascript Submit form functions using scrapy in python
- https://gist.github.com/cheekybastard/4944914
- https://gist.github.com/irfani/1045108
- http://snipplr.com/view/66998/
There is also an alternative to having to use Selenium
with Scrapy
. In some cases, using ScrapyJS
middleware is enough to handle the dynamic parts of a page. Sample real-world usage:
- Scraping dynamic content using python-Scrapy
How can I send Dynamic website content to scrapy with the html content generated by selenium browser?
The 2 solutions are not very different. Solution #2 fits better to your question, but choose whatever you prefer.
Solution 1 - create a response with the html's body from the driver and scraping it right away (you can also pass it as an argument to a function):
import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from scrapy.http import HtmlResponse
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
def start_requests(self):
# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
driver = webdriver.Chrome()
floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
for date in floorsheet_dates:
driver.get(
"https://merolagani.com/Floorsheet.aspx")
driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
total_length = driver.find_element(By.XPATH,
"//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(1, z + 1):
driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
self.body = driver.page_source
response = HtmlResponse(url=driver.current_url, body=self.body, encoding='utf-8')
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)
# return an empty requests list
return []
Solution 2 - with super simple downloader middleware:
(You might have a delay here in parse
method so be patient).
import scrapy
from scrapy import Request
from scrapy.http import HtmlResponse
from selenium import webdriver
from selenium.webdriver.common.by import By
class SeleniumMiddleware(object):
def process_request(self, request, spider):
url = spider.driver.current_url
body = spider.driver.page_source
return HtmlResponse(url=url, body=body, encoding='utf-8', request=request)
class FloorSheetSpider(scrapy.Spider):
name = "nepse"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'tempbuffer.spiders.yetanotherspider.SeleniumMiddleware': 543,
# 'projects_name.path.to.your.pipeline': 543
}
}
driver = webdriver.Chrome()
def start_requests(self):
# driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())
floorsheet_dates = ['01/03/2016','01/04/2016']#, up to till date '01/10/2022']
for date in floorsheet_dates:
self.driver.get(
"https://merolagani.com/Floorsheet.aspx")
self.driver.find_element(By.XPATH, "//input[@name='ctl00$ContentPlaceHolder1$txtFloorsheetDateFilter']"
).send_keys(date)
self.driver.find_element(By.XPATH, "(//a[@title='Search'])[3]").click()
total_length = self.driver.find_element(By.XPATH,
"//span[@id='ctl00_ContentPlaceHolder1_PagerControl2_litRecords']").text
z = int((total_length.split()[-1]).replace(']', ''))
for data in range(1, z + 1):
self.driver.find_element(By.XPATH, "(//a[@title='Page {}'])[2]".format(data)).click()
self.body = self.driver.page_source
self.url = self.driver.current_url
yield Request(url=self.url, callback=self.parse, dont_filter=True)
def parse(self, response, **kwargs):
print('test ok')
for value in response.xpath('//tbody/tr'):
print(value.css('td::text').extract()[1])
print("ok"*200)
Notice that I've used chrome so change it back to firefox like in your original code.
Next Page Selenium with Scrapy not working
For this particular web-site your code won't work as after you click Next button URL doesn't change. Try to wait until current page number is changed (instead of waiting for URL to change)
def parse(self,response):
self.driver.get(response.url)
current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
while True:
try:
elem = WebDriverWait(self.driver, 10).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="ContentPlaceHolder1_lvDataPager1"]/a[text()="Next"]')))
elem.click()
except TimeoutException:
break
WebDriverWait(self.driver, 10).until(lambda driver: self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text != current_page_number)
current_page_number = self.driver.find_element_by_css_selector('#ContentPlaceHolder1_lvDataPager1>span').text
yield scrapy.Request(url=url, callback=self.parse_page, dont_filter=False)
Python web scraping with Selenium on Dynamic Page - Issue with looping to next element
If you study the HTML of the page, they have the onclick script which basically triggers the JS and renders the pop-up. You can make use of it.
You can find the onclick script in the child element img
.
So your logic should be like (1)Get the child element (2)go to first child element (which is img always for your case) (3)Get the onclick script text (4)execute the script.
child element
for member in members:
print(member.text) # to see that loop went to next iteration
# member.find_element_by_xpath('//*[@class="projectImage"]').click()
#Begin of modification
child_elems = member.find_elements_by_css_selector("*") #Get the child elems
onclick_script = child_elems[0].get_attribute('onclick')#Get the img's onclick value
driver.execute_script(onclick_script) #Execute the JS
time.sleep(5) #Wait for some time
#end of modification
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.LINK_TEXT, 'VIEW FULL PROFILE')))
links = driver.find_element_by_partial_link_text("VIEW FULL PROFILE")
href = links.get_attribute("href")
member_list.append(href)
member.find_element_by_xpath("/html/body/div[5]/div[1]/button").click()
print(member_list)
You need to import time module. I prefer time.sleep
over wait.until
, it's more easier to use when you are starting with web scraping.
Scrape urls from dynamic webpage using Scrapy
You can add logic for gathering URLs to your parse() method by gathering the css hrefs:
def parse(self, response):
self.driver.get(response.url)
pause_time = 1
last_height = self.driver.execute_script("return document.body.scrollHeight")
#start = datetime.datetime.now()
urls = []
while True:
if len(urls) <= 10000:
for href in response.css('a::attr(href)'):
urls.append(href) # Follow tutorial to learn how to use the href object as you need
else:
break # Exit your while True statement when 10,000 links have been collected
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight + 400);")
time.sleep(pause_time)
print("\n\n\nend\n\n\n")
new_height = self.driver.execute_script("return document.body.scrollHeight")
There's a lot of information regarding how to handle links in the scrapy tutorial following links section. You can use the information there to learn what else you can do with links in scrapy.
I haven't tested this with the infinite scroll, so you may need to make some changes, but this should get you going in the right direction.
Related Topics
Multi Platform Portable Python
Gdb Pretty Printing with Python a Recursive Structure
Why Aren't Python Nested Functions Called Closures
How to Convert a String to a Number If It Has Commas in It as Thousands Separators
Your CPU Supports Instructions That This Tensorflow Binary Was Not Compiled to Use: Avx Avx2
How to Find the Time Difference Between Two Datetime Objects in Python
Using Pip to Install Packages to Anaconda Environment
Creating an Empty Pandas Dataframe, Then Filling It
How to Rotate an Image Around an Off Center Pivot in Pygame
Problems Adding Path and Calling External Program from Python
How Remove Camera Preview to Raspberry Pi
How to Emulate a Do-While Loop
Integer Division in Python 2 and Python 3
Understanding _Get_ and _Set_ and Python Descriptors