navigating through pagination with selenium in python
Before moving on to automating any scenario, always write down the manual steps you would perform to execute the scenario. Manual steps for what you want to (which I understand from the question) is -
1) Go to site - https://services.wiltshire.gov.uk/PlanningGIS/LLPG/WeeklyList
2) Select first week option
3) Click search
4) Get the data from every page
5) Load the url again
6) Select second week option
7) Click search
8) Get the data from every page
.. and so on.
You are having a loop to select different weeks but inside each loop iteration for the week option, you also need to include a loop to iterate over all the pages. Since you are not doing that, your code is returning only the data from the first page.
Another problem is with how you are locaing the 'Next' button -
driver.find_element_by_xpath('//*[@id="form1"]/div[3]/a[4]').click()
You are selecting the 4th <a>
element which is ofcourse not robust because in different pages, the Next button's index will be different. Instead, use this better locator -
driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
Logic for creating loop which will iterate through pages -
First you will need the number of pages. I did that by locating the <a>
immediately before the "Next" button. As per the screenshot below, it is clear that the text of this element will be equal to the number of pages -
-
I did that using following code -
number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
Now once you have number of pages as number_of_pages
, you only need to click "Next" button number_of_pages - 1
times!
Final code for your main
function-
def main():
all_data = []
select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
list_options = select.options
for item in range(len(list_options)):
select = Select(driver.find_element_by_xpath("//select[@class='formitem' and @id='selWeek']"))
select.select_by_index(str(item))
driver.find_element_by_css_selector("input.formbutton#csbtnSearch").click()
number_of_pages = int(driver.find_element_by_xpath("//a[contains(text(),'Next')]/preceding-sibling::a[1]").text)
for j in range(number_of_pages - 1):
all_data.extend(getData())
driver.find_element_by_xpath("//a[contains(text(),'Next')]").click()
time.sleep(1)
driver.get(url)
with open( 'wiltshire.json', 'w+' ) as f:
json.dump( all_data, f )
driver.quit()
Navigate pagination with Selenium Webdriver
Since the urls are in the format BASE_URL+page=NUM_PAGE, you could simply get the maximum page number (7 in your case).
In that way you can build all the urls with something like:
BASE_URL = "https://dati.comune.milano.it/dataset?groups=heal"
urls = []
for page_num in range(1, MAX_PAGES):
urls.append(f"{BASE_URL}&page={page_num}")
In this way you'll have all the pages without having to click anything, just knowing the maximum number of pages which you can easily find as you already did.
SELENIUM SOLUTION
There are probably one thousand ways and more to do this in a cleaner way, but this one works for me. Simply loop over the list of numbers until finding the "active" as you said and click the following.
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
BASE_URL = "https://dati.comune.milano.it/dataset?groups=heal&page=1" #first page
driver = webdriver.Chrome(
executable_path=ChromeDriverManager().install()
)
driver.get(BASE_URL)
url_list_xpath = "/html/body/div[2]/div/div[3]/div/section[1]/div[2]/ul" # this is the page bar at the bottom
to_click = False
last_page = driver.find_element_by_xpath("/html/body/div[2]/div/div[3]/div/section[1]/div[2]/ul/li[5]/a") \
.get_attribute("href") # find last page
current_page = BASE_URL
# iterate over the urls and click the next url after the active one
while current_page!=last_page:
ul = driver.find_element_by_xpath(url_list_xpath)
for li in ul.find_elements_by_tag_name("li"):
if to_click:
break
if li.get_attribute("class") == 'active':
to_click = True
to_click = False
current_page = li.find_elements_by_tag_name("a")[0].get_attribute("href")
driver.get(current_page)
How to navigate through pagination? (Selenium)
You can wrap the code inside a while loop, with a variable pagination_starting_point
and initial value could be set as 2, and for each iteration we will increase the counter.
Code :
driver = webdriver.Chrome(driver_path)
driver.maximize_window()
driver.implicitly_wait(30)
wait = WebDriverWait(driver, 30)
while True:
try:
col1 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[1]')
col2 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[2]')
col3 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[3]')
col4 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[4]')
col5 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[5]')
col6 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[6]')
col7 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[7]')
col8 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[8]')
col9 = driver.find_elements_by_xpath('//table[@id="ctl00_cph1_grdRfqSearch"]/tbody/tr/td[9]')
col1_data = [s.text for s in col1]
col2_data = [s.text for s in col2]
col3_data = [s.text for s in col3]
col4_data = [s.text for s in col4]
col5_data = [s.text for s in col5]
col6_data = [s.text for s in col6]
col7_data = [s.text for s in col7]
col8_data = [s.text for s in col8]
col9_data = [s.text for s in col9]
wait.until(EC.element_to_be_clickable((By.XPATH, f"//a[contains(@href,'Page${pagination_starting_point}')]"))).click()
print("Click on page " + pagination_starting_point)
pagination_starting_point = pagination_starting_point + 1
if pagination_starting_point == 45:
break
except:
print("Looks like job done !")
break
Imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Updated to use xpath //a[contains(@href,'Page${pagination_starting_point}')]
instead of page number.
Issue in navigating ASPX pagination using Selenium
Instead of clicking on every number, use the next button.
driver.implicitly_wait(5)
driver.get("https://www.shoroukbookstores.com/books/publishing-house.aspx?id=429c2704-5fa3-43b0-9ad7-c2ec9062beb3")
while True:
try:
navigation_links = driver.find_element_by_xpath(f"//*[@id='Body_AspNetPager']").find_elements_by_tag_name("a")
next_page = len(navigation_links) - 1 # next page button is second from behind
driver.find_element_by_xpath(f"//*[@id='Body_AspNetPager']/a[{next_page}]").click()
print("Next page clicked")
except NoSuchElementException:
print("No more pages")
break
driver.quit()
How can i navigate through pagination using selenium python?
In such cases, it is always interesting to have a closer look to the page to understand how the data is actually updated.
I did so opening the console in Firefox and having a look at the XHR
traffic network.
... interesting. The page is getting its results from an endpoint we could identify.
It returns json
data which is great:
{'totalJobs': 2541,
'jobs': [{'location': [{'jobLocationID': 0,
'jobID': 24986,
'countryID': 0,
'country': 'Pakistan',
'cityID': None,
'cityText': 'Karachi',
'jobShiftID': 0,
'name': None}],
'jobID': 24986,
'jobIDEncrypted': '26cfb27ee6b2abad',
'title': 'Marketing Officer - Freelancer',
'jobDescription': '<p>We are growing, energetic, and highly-reputed Public Relation (PR) and Digital Marketing Agency.<br />\nCurrently, we are looking for ...
Lets use this to write our script:
import requests
import math
#The scrapping function
def getJobs(pageNumber):
#Defining the headers
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0',
'Accept': 'application/json, text/plain, */*',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'X-Requested-With': 'XMLHttpRequest',
'Content-Type': 'application/json;charset=utf-8',
'DNT': '1',
'Connection': 'keep-alive',
'Referer': 'https://jobee.pk/jobs-in-pakistan',
'Pragma': 'no-cache'
}
#Setting the right params for the request we will make, pageSize is set to 200 (results by page)
data = {"model":{"titles":[],"cities":[],"shifts":[],"experinces":[],"careerLevels":[],"functionalAreas":[],"genders":[],"industries":[],"degreeLevels":[],"companies":[]},"pageNumber":1,"pageSize":200}
#Updating the page number
data['pageNumber'] = pageNumber
data = json.dumps(data)
#Collecting the results
response = requests.post('https://jobee.pk/job/jobsearch', headers=headers, data=data)
#Just in case an error shows up
try:
return json.loads(response.content)
except:
return {'jobs': []}
#Then lets get the page numbers from page 1
data = getJobs(1)
totalJobs = data['totalJobs']
number_of_pages = math.ceil(totalJobs /200)
#Initializing our job list
jobs_list = []
#Looping through the pages
for pageNumber in range(1,number_of_pages + 1):
results = getJobs(pageNumber)
#If no results we end the loop
if len(result) == 0:
break
else:
#We append the results in the ['job'] key to append it to our list
jobs_list += results['jobs']
print ('Page', pageNumber,'-', len(jobs_list), "jobs collected")
#Lets have a look to the data into a dataframe
df = pd.DataFrame(jobs_list)
print(df)
Output
Page 1 - 200 jobs collected
Page 2 - 400 jobs collected
Page 3 - 600 jobs collected
...
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
| | appliedByDate | companyName | experience | expiredDate | isSalaryVisible | jobDescription | jobID | jobIDEncrypted | location | logo | numberOfPositions | postDate | publishDate | salaryRange | skills | title | titleWithoutSpecialCharacters | viewCount |
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
| 0 | 0001-01-01T00:00:00 | Custom House | Fresh | 2019-09-19T00:00:00 | True | <p>We require Mean Stack Developer Interns who... | 27925 | a0962bea0bc174a1 | [{'jobLocationID': 0, 'jobID': 27925, 'country... | 14564Logo.jpg | 3 | 2019-06-21T14:04:01.363 | 2019-06-21T19:26:24.213 | 5000 - 10000 | [AngularJs, Mongo DB, JavaScript, Node Js, Mea... | Mean Stack Developer - Intern | Mean-Stack-Developer-Intern | 10 |
| 1 | 0001-01-01T00:00:00 | Custom House | Fresh | 2019-09-19T00:00:00 | True | <p>We requires SEO, Digital Marketing and Grap... | 27924 | 81e4e7f7d672dffd | [{'jobLocationID': 0, 'jobID': 27924, 'country... | 14564Logo.jpg | 2 | 2019-06-21T14:00:26.45 | 2019-06-21T19:25:04.493 | 5000 - 10000 | [Graphic Design, Search Engine Optimization (S... | SEO Executive / Graphic Designer - Intern | SEO-Executive-Graphic-Designer-Intern | 10 |
| 2 | 0001-01-01T00:00:00 | Printoscan Lahore | 1 Year | 2019-09-19T00:00:00 | True | <p>We require an <strong>Accounts Assistant / ... | 27923 | 137a257e9e5bbb5d | [{'jobLocationID': 0, 'jobID': 27923, 'country... | None | 1 | 2019-06-21T13:59:37.373 | 2019-06-21T19:19:07.36 | 15000 - 20000 | [Accounts Services, Administrative Skills, Acc... | Accounts Assistant / Administrator | Accounts-Assistant-Administrator | 6 |
+----+----------------------+--------------------+-------------+----------------------+------------------+----------------------------------------------------+--------+-------------------+----------------------------------------------------+----------------+--------------------+--------------------------+--------------------------+----------------+----------------------------------------------------+--------------------------------------------+----------------------------------------+-----------+
This is what we wanted.
Trouble Navigating To the Next Page Using Python Selenium
The following calculates the number of pages based on the result count and the known max number of results per page.
It loops through clicking on the appropriate href containing this page number. Where this number is not visible, the raised exception is handled and the initial pagination ellipsis is clicked to reveal the page.
I print the first tr
first td
, of pages greater than 1, to show that the page has been visited. I also swop out hard-coded waits for wait conditions.
I have used ChromeDriver.
This is to give you a framework to use. I tested it and it ran for all region selections and pages.
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException, NoSuchElementException
import math
results_per_page = 50
url = 'https://database.ib-net.org/search_utilities?type=2'
browser = webdriver.Chrome() #FireFox()
browser.get(url)
print("Retriving the site...")
# All regions available
regions = ['Africa', 'East Asia and Pacific', 'Europe and Central Asia', 'Latin America (including USA and Canada)', 'Middle East and Northern Africa', 'South Asia']
for region in regions:
# Select all options from drop down menu
selectOption = Select(browser.find_element_by_id('MainContent_ddRegion'))
print("Now constructing output for: " + region)
# Select table and wait for data to populate
selectOption.select_by_visible_text(region)
WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
num_results = int(browser.find_element_by_id('MainContent_lblqResults').text)
num_pages = math.ceil(num_results/results_per_page)
print(f'pages to scrape are: {num_pages}')
for page in range(2, num_pages + 1):
print(f'visiting page {page}')
try:
browser.find_element_by_css_selector(f'.pagination > li > [href*="Page\${page}"]').click()
WebDriverWait(browser, 5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '#MainContent_gvUtilities tr > td')))
print(browser.find_element_by_css_selector('#MainContent_gvUtilities tr:nth-child(2) span').text)
except NoSuchElementException:
browser.find_element_by_css_selector('.pagination > li > a').click()
except Exception as e:
print(e)
continue
how to handle pagination and scrape using selenium
If I understand your question in the right way, the following should do it:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get('https://www.amazon.in/Skybags-Brat-Black-Casual-Backpack/dp/B08Z1HHHTD/ref=sr_1_2?dchild=1&keywords=skybags&qid=1627786382&sr=8-2')
product_title = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "productTitle"))).text
print(product_title)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[@data-hook='see-all-reviews-link-foot']"))).click()
while True:
for item in WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "[data-hook='review']"))):
reviewer = item.find_element_by_css_selector("span.a-profile-name").text
review = ' '.join([i.text.strip() for i in item.find_elements_by_xpath(".//span[@data-hook='review-body']")])
print(reviewer,review)
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//*[@data-hook='pagination-bar']//a[contains(@href,'/product-reviews/') and contains(text(),'Next page')]"))).click()
WebDriverWait(driver, 10).until(EC.staleness_of(item))
except Exception as e:
break
driver.quit()
Related Topics
Why Does Tkinter Image Not Show Up If Created in a Function
How to Find the Unit Digits of a Specific Number
Test If Dictionary Key Exists, Is Not None and Isn't Blank
How to Pass a Dictionary Object as Parameter for a Function in Python
Python - Converting a List of 2 Digit String Numbers to a List of 2 Digit Integers
Get the Mean Across Multiple Pandas Dataframes
Ioerror: [Errno 32] Broken Pipe When Piping: 'Prog.Py | Othercmd'
How to Get the Latest File in a Folder
Printing Simple Diamond Pattern in Python
Convert a Standard Python Key Value Dictionary List to Pyspark Data Frame
Importerror: No Module Named Bs4 (Beautifulsoup)
Python Number With 1000 Separator
Convert a Python Int into a Big-Endian String of Bytes
Python Data Frame How to Find the Local Maximum in a 2D Array
How to Restart Airflow Webserver
Retrieve Top N in Each Group of a Dataframe in Pyspark
Correctly Reading Text from Windows-1252(Cp1252) File in Python